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Abstract 

We  determine  what  information  about  failures  is  necessary  and  sufficient  to 
solve  Consensus  in  asynchronous  distributed  systems  subject  to  crash  failures.  In 
[CT91],  we  proved  that  OW,  a  failure  detector  that  provides  surprisingly  little 
information  about  which  processes  have  crashed,  is  sufficient  to  solve  Consensus 
in  asynchronous  systems  with  a  majority  of  correct  processes.  In  this  paper,  we 
prove  that  to  solve  Consensus,  any  failure  detector  has  to  provide  at  least  as  much 
information  as  OW.  Thus,  OW  is  indeed  the  weakest  failure  detector  for  solving 
Consensus  in  asynchronous  systems  with  a  majority  of  correct  processes. 


1  Introduction 

1.1  Background 

The  asynchronous  model  of  distributed  computing  has  been  extensively  studied.  Infor¬ 
mally,  an  asynchronous  distributed  system  is  one  in  which  message  transmission  times 
and  relative  processor  speeds  are  both  unbounded.  Thus  an  algorithm  designed  for 
an  asynchronous  system  does  not  rely  on  such  bounds  for  its  correctness.  In  practice, 
asynchrony  is  introduced  by  unpredictable  loads  on  the  system. 

Although  the  asynchronous  model  of  computation  is  attractive  for  the  reasons  out¬ 
lined  above,  it  is  well-known  that  many  fundamental  problems  of  fault-tolerant  dis¬ 
tributed  computing  that  are  solvable  in  synchronous  systems,  are  unsolvable  in  asyn¬ 
chronous  systems.  In  particular,  it  is  well-known  that  Consensus,  and  several  forms 
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of  reliable  broadcast,  including  Atomic  Broadcast,  cannot  be  solved  deterministically  in 
an  asynchronous  system  that  is  subject  to  even  a  single  crash  failure  [FLP85,DDS87]. 
Essentially,  these  impossibility  results  stem  from  the  inherent  difficulty  of  determining 
whether  a  process  has  actually  crashed  or  is  only  “very  slow”. 

To  circumvent  these  impossibility  results,  previous  research  focused  on  the  use  of  ran¬ 
domization  techniques  [CD89],  the  definition  of  some  weaker  problems  and  their  solutions 
[DLP+86,ABD+87,BW87,BMZ88],  or  the  study  of  several  models  of  partial  synchrony 
[DDS87,DLS88].  However,  the  impossibility  of  deterministic  solutions  to  many  agreement 
problems  (such  as  Consensus  and  Atomic  Broadcast)  remains  a  major  obstacle  to  the 
use  of  the  asynchronous  model  of  computation  for  fault-tolerant  distributed  computing. 

An  alternative  approach  to  circumvent  such  impossibility  results  is  to  augment  the 
asynchronous  model  of  computation  with  a  failure  detector.  Informally,  a  failure  detector 
is  a  distributed  oracle  that  gives  (possibly  incorrect)  hints  about  which  processes  may 
have  crashed  so  far:  Each  process  has  access  to  a  local  failure  detector  module  that 
monitors  other  processes  in  the  system,  and  maintains  a  list  of  those  that  it  currently 
suspects  to  have  crashed.  Each  process  periodically  consults  its  failure  detector  module, 
and  uses  the  list  of  suspects  returned  in  solving  Consensus. 

A  failure  detector  module  can  make  mistakes  by  erroneously  adding  processes  to  its 
list  of  suspects:  i.e.,  it  can  suspect  that  a  process  p  has  crashed  even  though  p  is  still 
running.  If  it  later  believes  that  suspecting  p  was  a  mistake,  it  can  remove  p  from  its  list. 
Thus,  each  module  may  repeatedly  add  and  remove  processes  from  its  list  of  suspects. 
Furthermore,  at  any  given  time  the  failure  detector  modules  at  two  different  processes 
may  have  different  lists  of  suspects. 

It  is  important  to  note  that  the  mistakes  made  by  a  failure  detector  should  not  prevent 
any  correct  process  from  behaving  according  to  specification.  For  example,  consider  an 
algorithm  that  uses  a  failure  detector  to  solve  Atomic  Broadcast  in  an  asynchronous 
system.  Suppose  all  the  failure  detector  modules  wrongly  (and  permanently)  suspect 
that  a  correct  process  p  has  crashed.  The  Atomic  Broadcast  algorithm  must  still  ensure 
that  p  delivers  the  same  set  of  messages,  in  the  same  order,  as  all  the  other  correct 
processes.  Furthermore,  if  p  broadcasts  a  message  m,  all  correct  processes  must  deliver 
m.1 

In  [CTbl],  we  showed  that  a  surprisingly  weak  failure  detector  is  sufficient  to  solve 
Consensus  and  Atomic  Broadcast  in  asynchronous  systems  with  a  majority  of  correct 
processes.  This  failure  detector,  called  the  eventually  weak  failure  detector  and  denoted 
W  here,  satisfies  only  the  following  two  properties:2 

1.  There  is  a  time  after  which  every  process  that  crashes  is  always  suspected  by  some 
correct  process. 

lA  different  approach  was  taken  in  (RB91j:  a  correct  process  that  is  wrongly  suspected  to  have 
crashed,  voluntarily  leaves  the  system.  It  may  later  rejoin  the  system  by  assuming  a  new  identity. 

JIn  [CT91],  this  was  denoted  OW. 
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2.  There  is  a  time  after  which  some  correct  process  is  never  suspected  by  any  correct 
process. 

Note  that,  at  any  given  time  t,  processes  cannot  use  W  to  determine  the  identity  of  a 
correct  process.  Furthermore,  they  cannot  determine  whether  there  is  a  correct  process 
that  will  not  be  suspected  after  time  t. 

The  failure  detector  W  can  make  an  infinite  number  of  mistakes.  In  fact,  it  can  forever 
add  and  then  remove  some  correct  processes  from  the  lists  of  suspects  (this  reflects 
the  inherent  difficulty  of  determining  whether  a  process  is  just  slow  or  has  crashed). 
Moreover,  some  correct  processes  may  be  erroneously  suspected  to  have  crashed  by  all 
the  other  processes  throughout  the  entire  execution. 

The  two  properties  of  W  state  that  eventually  something  must  hold  forever;  this 
may  appear  too  strong  a  requirement  to  implement  in  practice.  However,  when  solving  a 
problem  that  “terminates” ,  such  as  Consensus,  it  is  not  really  required  that  the  properties 
hold  forever ,  but  merely  that  they  hold  for  a  sufficiently  long  time ,  i.e.,  long  enough  for 
the  algorithm  that  uses  the  failure  detector  to  achieve  its  goal.  For  instance,  in  practice 
the  algorithm  of  [CT91]  that  solves  Consensus  using  W  only  needs  the  two  properties  of 
W  to  hold  for  a  relatively  short  period  of  time.3  However,  in  an  asynchronous  system  it 
is  not  possible  to  quantify  “sufficiently  long” ,  since  even  a  single  process  step  or  a  single 
message  transmission  is  allowed  to  take  an  arbitrarily  long  amount  of  time.  Thus  it  is 
convenient  to  state  the  properties  of  W  in  the  stronger  form  given  above. 

1.2  The  problem 

The  failure  detection  properties  of  W  are  sufficient  to  solve  Consensus  in  asynchronous 
systems.  But  are  they  necessary ?  For  example,  consider  failure  detector  A  that  satisfies 
Property  1  of  W  and  the  following  weakening  of  Property  2: 

There  is  a  time  after  which  some  correct  process  is  never  suspected  by  at 
least  99%  of  the  correct  processes. 

A  is  clearly  weaker  than  >V.  Is  it  possible  to  solve  Consensus  using  A ?  Indeed  what 
is  the  weakest  failure  detector  sufficient  to  solve  Consensus  in  asynchronous  systems? 
In  trying  to  answer  this  fundamental  question  we  run  into  a  problem.  Consider  failure 
detector  B  that  satisfies  the  following  two  properties: 

1.  There  is  a  time  after  which  every  process  that  crashes  is  always  suspected  by  all 
correct  processes. 

2.  There  is  a  time  after  which  some  correct  process  is  never  suspected  by  a  majority 
of  the  processes. 

*In  that  algorithm  processes  are  cyclically  elected  as  “coordinators”.  Consensus  is  achieved  as  soon 
as  a  correct  coordinator  is  reached,  and  no  process  suspects  it  to  have  crashed  while  this  coordinator  is 
trying  to  enforce  consensus. 
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It  seems  that  B  and  W  are  incomparable:  B’s  first  property  is  stronger  than  W’s,  and  B’s 
second  property  is  weaker  than  W’s.  Is  it  possible  to  solve  Consensus  in  an  asynchronous 
system  using  B?  The  answer  turns  out  to  be  “yes”  (provided  that  this  asynchronous 
system  has  a  majority  of  correct  processes,  as  W  also  requires).  Since  W  and  B  appear 
to  be  incomparable,  one  may  be  tempted  to  conclude  that  W  cannot  be  the  “weakest” 
failure  detector  with  which  Consensus  is  solvable.  Even  worse,  it  raises  the  possibility 
that  no  such  “weakest”  failure  detector  exists. 

However,  a  closer  examination  reveals  that  B  and  W  are  indeed  comparable  in  a 
natural  way:  There  is  a  distributed  algorithm  Tg_w  that  can  transform  B  into  a  failure 
detector  with  the  Properties  1  and  2  of  W.  Tb—w  works  for  any  asynchronous  system 
that  has  a  majority  of  correct  processes.  We  say  that  W  is  reducible  to  B  in  such  a 
system.  Since  Tg_w  is  able  to  transform  B  into  W  in  an  asynchronous  system,  B  must 
provide  at  least  as  much  information  about  process  failures  as  W  does.  Intuitively,  B  is 
at  least  as  strong  as  W. 

1.3  The  result 

In  [CT91],  we  showed  that  W  is  sufficient  to  solve  Consensus  in  asynchronous  systems 
if  and  only  if  n  >  2/  (where  n  is  the  total  number  of  processes,  and  /  is  the  maximum 
number  of  processes  that  may  crash).  In  this  paper,  we  prove  that  W  is  reducible 
to  any  failure  detector  V  that  can  be  used  to  solve  Consensus  (this  result  holds  for 
any  asynchronous  system).  We  show  this  reduction  by  giving  a  distributed  algorithm 
Td_vv  that  transforms  any  such  V  into  W.  Therefore,  W  is  indeed  the  weakest  failure 
detector  that  can  be  used  to  solve  Consensus  in  asynchronous  systems  with  n  >  2 /. 
Furthermore,  if  n  <  2/,  any  failure  detector  that  can  be  used  to  solve  Consensus  must 
be  strictly  stronger  than  W. 

The  task  of  transforming  any  given  failure  detector  V  (that  can  be  used  to  solve 
Consensus)  into  W  runs  into  a  serious  technical  difficulty  for  the  following  reasons: 

•  To  strengthen  our  result,  we  do  not  restrict  the  output  of  V  to  lists  of  suspects. 
Instead,  this  output  can  be  any  value  that  encodes  some  information  about  failures. 
For  example,  a  failure  detector  V  should  be  allowed  to  output  any  boolean  formula, 
such  as  “(not  p )  and  ( q  or  r)”  (i.e.,  p  is  up  and  either  q  or  r  has  crashed) — or  any 
encoding  of  such  a  formula.  Indeed,  the  output  of  V  could  be  an  arbitrarily  complex 
(and  unknown)  encoding  of  failure  information.  Our  transformation  from  V  into 
W  must  be  able  to  decode  this  information. 

•  Even  if  the  failure  information  provided  by  V  is  not  encoded,  it  is  not  clear  how  to 
extract  from  it  the  failure  detection  properties  of  W.  Consequently,  if  V  is  given 
in  isolation,  the  task  of  transforming  it  into  W  may  not  be  possible. 

Fortunately,  since  V  can  be  used  to  solve  Consensus,  there  is  a  corresponding  algorithm. 
Consensus p,  that  is  somehow  able  to  “decode”  the  information  about  failures  provided 
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by  Z>,  and  knows  Low  to  use  it  to  solve  Consensus.  Our  reduction  algorithm,  Tr>_w  uses 
Consensu^  to  extract  this  information  from  V  and  transforms  it  into  the  properties  of 
W. 

2  The  model 

We  describe  a  model  of  asynchronous  computation  with  failure  detection  patterned  after 
the  one  in  [FLP85]. 

2.1  Failure  Detectors 

We  assume  the  existence  of  a  discrete  global  clock  to  simplify  the  presentation.  This  is 
merely  a  fictional  device:  the  processes  do  not  have  access  to  it.  We  take  the  range  T  of 
the  clock’s  ticks  to  be  the  set  of  natural  numbers. 

The  system  consists  of  a  set  of  n  processes,  II  =  {pi,P2,  •  •  •  ,pn}>  that  may  fail  by 
crashing.  A  failure  pattern  F  is  a  function  from  T  to  2n,  where  F(t)  denotes  the  set 
of  processes  that  have  crashed  through  time  t.  Once  a  process  crashes,  it  does  not 
“recover”,  i.e.,  Vt :  F(t)  C  F(t+ 1).  We  define  crashed(F)  =  Utgr  F(t)  aud  correct  F)  = 
II  -  crashed(F).  If  p  6  crashed(F)  we  say  p  crashes  in  F  and  if  p  e  correct(F)  we  say  p 
is  correct  in  F. 

Associated  with  each  failure  detector  is  a  range  H  of  values  output  by  that  failure 
detector.  A  failure  detector  history  H  with  range  His  a  function  from  II  x  T  to  H.  H(p,t) 
is  the  value  of  the  failure  detector  module  of  process  p  at  time  t.  A  failure  detector  V 
is  a  function  that  maps  each  failure  pattern  F  to  a  set  of  failure  detector  histories  with 
range  Hv  (where  Hv  denotes  the  range  of  failure  detector  outputs  of  V).  V(F)  denotes 
the  set  of  possible  failure  detector  histories  permitted  by  V  for  the  failure  pattern  F. 

For  example,  consider  the  failure  detector  W  mentioned  in  the  introduction.  Each 
failure  detector  module  of  W  outputs  a  set  of  processes  that  are  suspected  to  have 
crashed:  in  this  case  ftyv  =  2n.  For  each  failure  pattern  F,  W(F)  is  the  set  of  all  failure 
detector  histories  Hw  with  range  Hw  that  satisfy  the  following  properties: 

1.  There  is  a  time  after  which  every  process  that  crashes  in  F  is  always  suspected  by 
some  process  that  is  correct  in  F: 

3/  €  T,  Vp  €  crashed{F ),  3 q  €  correct  F),Vt'  >  t  :  p  €  f^w(g,t,) 

2.  There  is  a  time  after  which  some  process  that  is  correct  in  F  is  never  suspected  by 
any  process  that  is  correct  in  F: 


3t  €  T,  3p  €  correc1{F),'iq  €  correc^F),^1  >t:p&  Hw(q,t') 
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Note  that  we  specify  a  failure  detector  V  as  a  function  of  the  failure  pattern  F  of 
an  execution.  However,  this  does  not  preclude  an  implementation  of  V  from  using  other 
aspects  of  the  execution  such  as  when  messages  are  received.  Thus,  executions  with  the 
same  failure  pattern  F  may  still  have  different  failure  detector  histories.  It  is  for  this 
reason  that  we  allow  T>(F)  to  be  a  set  of  failure  detector  histories  from  which  the  actual 
failure  detector  history  for  a  particular  execution  is  selected  non-deterministically. 

2.2  Algorithms 

We  model  the  asynchronous  communication  channels  as  a  message  buffer  which  contains 
messages  of  the  form  (p,  data,  q )  indicating  that  process  p  has  sent  data  addressed  to 
process  q  and  q  has  not  yet  received  that  message.  An  algorithm  A  is  a  collection  of 
n  (possibly  infinite  state)  deterministic  automata,  one  for  each  of  the  processes.  A(p) 
denotes  the  automaton  running  on  process  p.  Computation  proceeds  in  steps  of  the  given 
algorithm  A.  In  each  step  of  A,  process  p  performs  atomically  the  following  three  phases: 

Receive  phase:  p  receives  a  single  message  of  the  form  (q,  data ,  p)  from  the  message 
buffer,  or  a  “null”  message,  denoted  A,  meaning  that  no  message  is  received  by  p 
during  this  step. 

Failure  detector  query  phase:  p  queries  and  receives  a  value  from  its  failure  detector 
module.  We  say  that  p  sees  a  value  d  when  the  value  returned  by  p's  failure  detector 
module  is  d. 

Send  phase:  p  changes  its  state  and  sends  a  message  to  all  the  processes  according  to 
the  automaton  A(p),  based  on  its  state  at  the  beginning  of  the  step,  the  message 
received  in  the  receive  phase,  and  the  value  that  p  sees  in  the  failure  detector  query 
phase.4 

The  message  actually  received  by  the  process  p  in  the  receive  phase  is  chosen  non- 
deterministically  from  amongst  the  messages  in  the  message  buffer  destined  to  p,  and 
the  null  message  A.  The  null  message  may  be  received  even  if  there  are  messages  in  the 
message  buffer  that  are  destined  to  p:  the  fact  that  m  is  in  the  message  buffer  merely 
indicates  that  m  was  sent  to  p.  Since  ours  will  be  a  model  of  asynchronous  systems, 
where  messages  may  experience  arbitrary  (but  finite)  delays,  the  amount  of  time  m 
may  remain  in  the  message  buffer  before  it  is  received  is  unbounded.  Indeed,  our 
model  will  allow  a  message  sent  later  than  another  to  be  received  earlier  than  the  other. 
Though  message  delays  are  arbitrary,  we  also  want  them  to  be  finite.  We  model  this 
by  introducing  a  liveness  assumption:  every  message  sent  will  eventually  be  received, 

4In  the  send  phase,  p  sends  a  message  to  all  the  processes  atomically.  As  was  shown  in  [FLP85], 
the  ability  to  do  so  is  not  sufficient  for  solving  Consensus.  An  alternative  formulation  of  a  step  could 
restrict  a  process  to  sending  a  message  to  a  single  process  in  the  send  phase.  We  can  show  that  both 
formulations  are  equivalent  for  our  purposes. 
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provided  its  recipient  makes  “sufficiently  many”  attempts  to  receive  messages.  All  this 
will  be  made  more  precise  later. 

We  also  remark  that  the  non-determinism  arising  from  the  choice  of  the  message  to  be 
received  reflects  the  asynchrony  of  the  message  buffer  —  it  is  not  due  to  non-deterministic 
choices  made  by  the  process.  The  automaton  A(p)  is  deterministic  in  the  sense  that  the 
message  that  p  sends  in  a  step  and  p’s  new  state  are  uniquely  determined  from  the 
present  state  of  p,  the  message  p  received  during  the  step  and  the  failure  detector  value 
seen  by  p  during  the  step. 

To  keep  things  simple  we  assume  that  a  process  p  sends  a  message  m  to  q  at  most 
once.  This  allows  us  to  speak  of  the  contents  of  the  message  buffer  as  a  set,  rather  than 
a  multiset.  We  can  easily  enforce  this  by  adding  a  counter  to  each  message  sent  by  p  to 
q  —  so  this  assumption  does  not  damage  generality. 

2.3  Configurations,  Runs  and  Environments 

A  configuration  is  a  pair  ( s,M ),  where  s  is  a  function  mapping  each  process  p  to  its  local 
state,  and  M  is  a  set  of  triples  of  the  form  ( q ,  data,p)  representing  the  messages  presently 
in  the  message  buffer.  An  initial  configuration  of  an  algorithm  A  is  a  configuration  (s,  M), 
where  s(p)  is  an  initial  state  of  A(p)  and  M  =  0.  A  step  of  a  given  algorithm  A  transforms 
one  configuration  to  another.  A  step  of  A  is  uniquely  determined  by  the  identity  of  the 
process  p  that  takes  the  step,  the  message  m  received  by  p  during  that  step,  and  the 
failure  detector  value  d  seen  by  p  during  the  step.  Thus,  we  identify  a  step  of  A  with  a 
tuple  (p,m,d,  A).  If  the  message  received  in  that  step  is  the  null  message,  then  m  —  A, 
otherwise  m  is  of  the  type  (-,  -  ,p).  We  say  that  a  step  e  =  (p,m,d,  A)  is  applicable  to 
a  configuration  C  =  (s,  M)  if  and  only  if  m  €  M  U  {A}.  We  write  e(C)  to  denote  the 
unique  configuration  that  results  when  e  is  applied  to  C. 

A  schedule  S  of  algorithm  A  is  a  finite  or  infinite  sequence  of  steps  of  A.  5X  denotes 
the  empty  schedule.  We  say  that  a  schedule  5  of  an  algorithm  A  is  applicable  to  a 
configuration  C  if  and  only  if  (a)  5  =  Sj.,  or  (b)  S[l]  is  applicable  to  C,  5 [2]  is  applicable 
to  S[1](C),  etc.5  If  5  is  a  finite  schedule  applicable  to  C,  5(C)  denotes  the  unique 
configuration  that  results  from  applying  5  to  C.  Note  5x(C)  =  C  for  all  configurations 
C.  We  say  that  C'  is  a  configuration  of  (5,  C)  if  there  is  a  prefix  S'  of  5  such  that 
a  =  s'(C). 

A  partial  run  of  algorithm  A  using  a  failure  detector  V  is  a  tuple  R  =  ( F ,  Hp,  /,  5,  T ) 
where  F  is  a  failure  pattern,  H-p  €  V(F)  is  a  failure  detector  history,  I  is  an  initial 
configuration  of  A,  5  is  a  finite  schedule  of  A,  and  T  is  a  finite  list  of  increasing  time 
values  (indicating  when  each  step  in  S  occurred)  such  that  |5|  =  \T\,  S  is  applicable  to 
/,  and  for  all  i  <  |5|,  if  5{t]  is  of  the  form  (p,  m,  d,  A)  then: 

•  p  has  not  crashed  by  time  T[t],  i.e.,  p  £  F(T[i]) 

•  d  is  the  value  of  the  failure  detector  module  of  p  at  time  T[i],  i.e.,  d  =  Hv(p,T[i ]) 
sWe  denote  by  v[i]  the  ith  element  of  a  sequence  v. 
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Informally,  a  partial  run  of  A  using  V  represents  a  finite  point  of  some  execution  of  A 
using  V. 

A  run  of  an  algorithm  A  using  a  failure  detector  V  is  a  tuple  R  =  (F,  Hp,  I,  S,T) 
where  F  is  a  failure  pattern,  Ht>  €  D(F)  is  a  failure  detector  history,  I  is  an  initial 
configuration  of  A,  S  is  an  infinite  schedule  of  A,  and  T  is  an  infinite  list  of  increasing 
time  values  indicating  when  each  step  in  S  occurred.  In  addition  to  satisfying  the  above 
properties  of  a  partial  run,  a  run  must  also  satisfy  the  following  properties: 

•  Every  correct  process  takes  an  infinite  number  of  ste^s  in  5.  Formally: 

Vp  €  correct(F),  Vi,  3j  >  i  :  5[j]  is  of  the  type(p,  -,  -,  A) 

•  Every  message  sent  to  a  correct  process  is  eventually  received.  Formally: 

Vp  €  correct(F),  VC  =  ( s,M )  of  (5,/)  :  m  =  ( q,data,p )  6  M  =t> 

(3i :  S[i]  is  of  the  type  (p,  m ,  -,  A)) 

In  [CT91],  we  proved  that  any  algorithm  that  uses  W  to  solve  Consensus  requires 
n  >  2f.  With  other  failure  detectors  the  requirements  may  be  different.  For  example, 
there  is  a  failure  detector  that  can  be  used  to  solve  Consensus  only  if  px  and  p2  do  not  both 
crash.  In  general  whether  a  given  failure  detector  can  be  used  to  solve  Consensus  depends 
upon  assumptions  about  the  underlying  “environment” .  Formally,  an  environment  £  (of 
an  asynchronous  system)  is  set  of  possible  failure  patterns.® 

3  The  Consensus  problem 

In  the  Consensus  problem,  each  process  p  has  an  initial  value,  0  or  1,  and  must  reach  an 
irrevocable  decision  on  one  of  these  values.  Thus,  the  algorithm  of  process  p,  A(p),  has 
two  distinct  initial  states  erg  and  of  signifying  that  p’s  initial  value  is  0  or  1.  A(p)  also 
has  two  disjoint  sets  of  decision  states  and  If.  If  p  enters  a  state  in  £*,  we  require 
that  it  remain  in  states  in  and  we  say  that  p  has  decided  k. 

We  say  that  algorithm  A  uses  failure  detector  V  to  solve  Consensus  in  environment 
£  if  every  run  R  =  (F,  Hv,  /,  5,  T)  of  A  using  V  where  F  €  £  satisfies: 

Termination:  Each  correct  process  eventually  decides.  Formally: 

Vp  €  correct(F),  3 C  =  (s,  M )  of  (5, 1)  :  s(p)  €  SJ  U  E? 


*In  a  synchronous  system,  assumptions  about  the  underlying  environment  may  also  include  other 
characteristics  such  as  the  relative  process  speeds,  the  maximum  message  delay,  the  degree  of  clock 
synchronization,  etc.  In  such  a  system,  a  more  elaborate  definition  of  an  environment  would  be  required. 
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Validity:  Each  correct  process  decides  on  the  initial  value  of  some  process.  Formally, 
let  I  =  (s0,  Mo): 


Vp  €  correct(F),Vk  €  {0, 1}  :  (3 C  =  (s,  M)  of  ( S,I )  : 

sip)  6  2*)  =>  (3g  e  n  :  s0(<z)  =  *2) 

Agreement:  No  two  correct  processes  decide  differently.  Formally: 

Vp,p'  6  correct(F),VC  =  (s,M)  of  (S,J),Vfc,fc'  6  {0,1}  : 

(s(p)  €  ££  A  s(p')  6  S£)  =>  fc  =  fc' 

4  Reductibility 

We  now  define  what  it  means  for  an  algorithm  Td-.v  to  transform  a  failure  detector 
T>  into  another  failure  detector  T?  in  an  environment  £.  Algorithm  Tv—v  uses  D  to 
maintain  a  variable  output?  at  every  process  p.  This  variable,  reflected  in  the  local  state 
of  p,  emulates  the  output  of  V  at  p.  Let  Or  be  the  history  of  all  the  output  variables 
in  run  R,  i.e.,  OR{p,t)  is  the  value  of  output p  at  time  t  in  run  R.  Algorithm  Tv-~v 
transforms  V  into  V  in  £  if  and  only  if  for  every  run  R  =  (F,Ht>,I,S,T)  of  , 
using  V,  where  F  6  £,  Or  €  T^iF). 

Given  TV-.V’,  anything  that  can  be  done  using  V  in  £,  can  be  done  using  V  instead. 
To  see  this,  suppose  a  given  algorithm  B  requires  failure  detector  V  (when  it  executes  in 
£),  but  only  V  is  available.  We  can  still  execute  B  as  follows.  Concurrently  with  B,  we 
run  Tp_r>'  to  transform  V  into  V.  We  now  modify  the  failure  detector  query  phase  of 
each  step  of  B  at  process  p:  p  reads  the  current  value  of  output p  (which  is  concurrently 
maintained  by  Tv—v )  instead  of  querying  its  failure  detector  module.  This  is  illustrated 
in  Fig.  1. 

Intuitively,  since  Tv-^v  is  able  to  use  V  to  emulate  V,  V  provides  at  least  as  much 
information  about  process  failures  in  £  as  1?  does.  Thus,  if  there  is  an  algorithm  Td_i>> 
that  transforms  V  into  Tt  in  £,  we  write  V  V  and  say  that  V  is  reducible  to  V  in 
£\  we  also  say  that  V  is  weaker  than  V  in  £. 

5  An  outline  of  the  result 

In  [CT91]  we  showed  that  W  can  be  used  to  solve  Consensus  in  any  environment  in  which 
n  >  2/.  We  now  show  that  VV  is  weaker  than  any  failure  detector  that  can  be  used  to 
solve  Consensus.  This  result  holds  for  any  environment  £.  Together  with  [CT91],  this 
implies  that  W  is  indeed  the  weakest  failure  detector  that  cam  be  used  to  solve  Consensus 
in  any  environment  in  which  n  >  2  f. 
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Figure  1:  Transforming  V  into  V 


To  prove  our  result,  we  first  define  a  new  failure  detector,  denoted  ft,  that  is  at  least 
as  strong  as  W.  We  then  show  that  any  failure  detector  V  that  can  be  used  to  solve 
Consensus  is  at  least  as  strong  as  ft.  Thus,  V  is  at  least  as  strong  as  W. 

The  output  of  the  failure  detector  module  of  ft  at  a  process  p  is  a  single  process,  q, 
that  p  currently  considers  to  be  correct,  we  say  that  p  trusts  q.  In  this  case,  Tin  —  II. 
For  each  failure  pattern  F,  ft (F)  is  the  set  of  all  failure  detector  histories  Hn  with  range 
Tin  that  satisfy  the  following  property: 

•  There  is  a  time  after  which  all  the  correct  processes  always  trust  the  same  correct 
process: 


3 1  eT,3q€  correct(F),  Vp  €  correct(F),Vt'  >  t :  Hn(p,t')  =  q 

As  with  W,  the  output  of  the  failure  detector  module  of  ft  at  a  process  p  may  change 
with  time,  i.e.,  p  may  trust  different  processes  at  different  times.  Furthermore,  at  any 
given  time  t,  processes  p  and  q  may  trust  different  processes. 

Theorem  1:  For  all  environments  S ,  ft  >e  W. 

Proof:  [Sketch]  The  reduction  algorithm  Tn-w  that  transforms  ft  into  W  is  as  follows. 
Each  process  p  periodically  sets  output p  *—  II  —  {g},  where  q  is  the  process  that  p  currently 
trusts  according  to  ft.  It  is  easy  to  see  that  (in  any  environment  S)  this  output  satisfies 
the  two  properties  of  W.  a 

Theorem  2:  For  all  environments  6 ,  if  a  failure  detector  V  can  be  used  to  solve 
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Consensus  in  £,  then  V  >£  fi- 

PROOF:  The  reduction  algorithm  Tv— n  is  shown  in  Section  6.  It  is  the  core  of  our  result. 

□ 

Corollary  3:  For  all  environments  £,  if  a  failure  detector  V  can  be  used  to  solve  Con¬ 
sensus  in  £,  then  V  >£  W. 

Proof:  If  V  can  be  used  to  solve  Consensus  in  £,  then,  by  Theorem  2,  V  >£  f2.  From 
Theorem  1,  Q,  >£  W.  By  transitivity,  V  >£W.  □ 

In  [CT91]  we  proved  that,  for  all  environments  £  in  which  n  >  2/,  W  can  be  used  to 
solve  Consensus.  Together  with  Corollary  3,  this  shows  that: 

Theorem  4:  For  all  environments  £  in  which  n  >  2/,  W  is  the  weakest  failure  detector 
that  can  be  used  to  solve  Consensus  in  £. 


6  The  reduction  algorithm 

Let  £  be  an  environment,  V  be  a  failure  detector  that  can  be  used  to  solve  Consensus  in 
£,  and  Consensus  be  the  Consensus  algorithm  that  uses  V.  We  describe  an  algorithm 
Tv— n  that  transforms  V  into  fl  in  £.  Intuitively,  this  algorithm  works  as  follows.  Fix 
an  arbitrary  run  of  TV-n  using  V  in  £,  with  failure  pattern  F  6  £,  and  failure  detector 
history  Hv  €  V(F).  We  shall  first  construct  an  infinite  directed  acyclic  graph,  denoted 
G,  whose  vertices  are  some  of  the  failure  detector  values  that  occur  in  Hv ,  and  whose 
edges  are  consistent  with  the  time  at  which  these  values  occur.  We  then  show  that  G 
induces  a  simulation  forest  T  that  encodes  an  infinite  set  of  possible  runs  of  Consensus- 
Finally,  we  show  how  to  extract  from  T  the  identity  of  a  process  p*  that  is  correct  in  F. 

The  induced  simulation  forest  is  infinite  and  thus  cannot  be  computed  by  any  process. 
However,  the  information  needed  to  extract  p*  is  present  in  a  finite  subgraph  of  the 
forest.  It  will  be  sufficient  for  each  correct  process  p  to  construct  ever  increasing  finite 
approximations  of  the  simulation  forest  T  that  will  eventually  include  this  crucial  finite 
subgraph.  At  all  times,  p  uses  its  present  approximation  of  T  to  select  the  identity 
of  some  process:  once  p’s  approximation  of  T  includes  the  crucial  finite  subgraph,  the 
selected  process  will  be  p*  (forever).  Thus,  there  is  a  time  after  which  all  correct  processes 
trust  the  same  correct  process,  p* — which  is  exactly  what  0  requires. 

We  say  that  a  process  is  correct  (crashes)  if  it  is  correct  (crashes)  in  F.  For  simplicity, 
we  assume  that  a  process  p  sees  a  value  d  at  most  once  (this  can  be  enforced  by  tagging 
a  counter  to  each  value  seen).  For  the  rest  of  this  paper,  whenever  we  refer  to  a  run 
of  Consensu»v ,  we  mean  a  run  of  Consensus  using  V.  Furthermore,  we  only  consider 
schedules  of  Consensusv ,  and  therefore  we  write  (p,  m,  d)  instead  of  (p,  m,  <i,  Consensus) 
to  denote  a  step. 
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6.1  A  DAG  and  a  forest 

Given  the  failure  pattern  F  and  the  corresponding  failure  detector  history  Hp  €  'D(F) 
that  were  fixed  above,  let  G  be  any  infinite  directed  acyclic  graph  with  the  following 
properties: 

1.  The  vertices  of  G  are  of  the  form  [p,  d\  where  p  €  II  and  d  €  Up-  If  [p,  is  a 
vertex  of  G,  then  there  is  a  time  t  such  that  p  &  F(t)  and  d  =  Hp(p,t)  (i.e.,  at 
«,ime  t,  p  has  not  crashed  and  the  value  of  p’s  failure  detector  module  is  d). 

2.  If  [gi,di]  — ♦  (92,^2]  is  an  edge  of  G  and  dx  =  Hp(qi,tx)  and  d2  =  Hp(q2,t2)  then 
ti  <  t2. 

3.  G  is  transitively  closed. 

4.  Let  p  be  any  correct  process  and  V  be  a  finite  subset  of  vertices  of  G.  There  is  a 
failure  detector  value  d  such  that  for  all  vertices  [p',  d'\  in  V,  \p',d']  — ►  [p,  d\  is  an 
edge  of  G. 

Note  that  such  a  DAG  represents  only  a  “sampling”  of  the  failure  detector  values  that 
occur  in  Hp.  In  particular,  we  do  not  require  that  it  contain  all  the  values  that  occur  in 
Hp  or  that  it  relate  (with  an  edge)  all  the  values  according  to  the  time  at  which  they 
occur.  However,  Property  4  implies  that  the  DAG  contains  infinitely  many  “samplings” 
of  the  failure  detector  module  of  each  correct  process. 

Lemma  5:  Let  V  be  any  finite  subset  of  vertices  in  G.  G  has  an  infinite  path  g  such 
that: 

•  There  is  an  edge  from  every  vertex  of  V  to  the  first  vertex  of  g. 

•  If  [p,  -]  is  a  vertex  of  g  then  p  is  correct;  for  each  correct  p,  there  are  infinitely 
many  vertices  [p,  — ]  in  g. 

Proof:  By  repeated  application  of  Property  4.  □ 

Let  g  =  [9i,dx],  [q2,d2],. . .  be  any  (finite  or  infinite)  path  of  G.  A  schedule  5  is 
compatible  with  g  if  it  has  the  same  length  as  g,  and  5  =  (qi,  mlt  di),  (q2,  m2,  d2), . . ., 
for  some  (possibly  null)  messages  We  say  that  5  is  compatible  with  G  if  it  is 

compatible  with  some  path  of  G. 

Let  I  be  any  initial  configuration  of  Consensus p.  We  define  the  simulation  tree  TG 
induced  by  G  and  I  as  follows.  The  vertices  of  are  the  finite  schedules  5  that  are 
compatible  with  G  and  are  applicable  to  I.  The  root  of  is  the  empty  schedule  5j_. 
There  is  an  edge  from  vertex  5  to  vertex  S'  if  and  only  if  S'  =  S  -e  for  a  step  e;7  this  edge 
is  labeled  e.  With  each  (finite  or  infinite)  path  in  T^,  we  associate  the  unique  schedule 


7If  u,w  are  sequences  and  v  is  finite  then  v  ■  w  denotes  the  concatenation  of  the  two  sequences. 
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S  =  Ci,  e2, .  •  • ,  ejt, . . .  consisting  of  the  sequence  of  labels  of  the  edges  on  that  path.  Note 
that  if  a  path  starts  from  the  root  of  and  it  is  finite,  the  schedule  S  associated  with 
it  is  also  the  last  vertex  of  that  path. 

Lemma  6:  5  is  a  schedule  associated  with  a  path  of  Tq  that  starts  from  the  root  if 
and  only  if  S  is  a  schedule  compatible  with  G  and  applicable  to  7. 

Proof:  The  lemma  obviously  holds  if  S  is  a  finite  schedule  (this  is  immediate  from  the 
definitions).  Now  let  5  =  eit  e2, . . . ,  e,-, ...  be  an  infinite  schedule,  where  e,-  =  [$,  mj,  df[. 
We  define  So  =  S±,  Si  =  ei,  S2  =  Si  •  e2,  and  in  general  Si  =  Sj_j  •  e*  for  all  i  =  1, 2, . . . 

Assume  that  S  is  compatible  with  G  and  applicable  to  7.  We  must  show  that  5  is  a 
schedule  associated  with  a  path  of  that  starts  from  the  root.  To  see  this,  note  that 
for  all  i  >  0,  S,  is  a  finite  schedule  that  is  also  compatible  with  G  and  applicable  to  I . 
Thus,  all  the  schedules  So,  Si,  S2, . . . ,  Sj_i,  S<, . . .  are  vertices  of  T q.  Since  Si  =  Si- 1  •  e*, 
the  edge  from  S<_  1  to  Si  is  labeled  e,-,  for  all  i  >  1.  Thus,  S  =  e1,e2, . . .  ,ej, . . .  is  the 
schedule  associated  with  the  infinite  path  So  — ►  Si  — *  S2  — < ►  . . .  — ►  Si- 1  — ►  Si  — ►  . . .  of 

Tgj  this  path  starts  from  the  root  So  =  S±. 

Assume  that  S  is  a  schedule  associated  with  am  infinite  path  of  Tg  that  starts  from 
the  root.  We  must  show  that  S  is  compatible  with  G  and  is  applicable  to  7.  First 
note  that  for  all  i,  S*  is  a  vertex  in  Y^,  thus  S,  is  compatible  with  G  and  is  applicable 
to  7.  Since  S{  =  [<Zi,ttii,  di],  [q2,  m2,  d2], . . . ,  [<fc,  m^di]  is  compatible  with  G,  G  must 
contain  the  path  tr*  =  [qi,di],  [92,^2],  •  •  • ,  (for  all  i).  Note  that,  for  all  i,  iri+i  = 

■Ki  •  [qi+i,  dj+i]  is  an  extension  of  the  path  ir <  in  G.  Therefore,  G  contains  the  infinite  path 
[qi,  di],  [<72,  d2], . . . ,  [qi,  dj], ...  So  S  is  compatible  with  G.  Furthermore,  since  all  Sfs  are 
applicable  to  7,  by  definition  of  applicability,  the  infinite  schedule  S  is  also  applicable  to 
7.  Thus,  5  is  compatible  with  G  and  applicable  to  7.  □ 

The  following  two  lemmata  show  that  the  finite  and  infinite  paths  of  correspond 
to  partial  runs  and  runs  of  Consensu^  with  initial  configuration  7. 

Lemma  7:  Let  5  be  a  schedule  associated  with  a  finite  path  of  Tq  that  starts  from 
the  root.  There  is  a  sequence  of  times  T  such  that  (F,  77p,7, 5,T)  is  a  partial  run  of 
Consensus p. 

Proof:  By  Lemma  6,  5  is  applicable  to  7  and  compatible  with  G.  Thus  5  is  compatible 
with  some  finite  path  g  =  [gx,  di],  [q2,  d2], . . . ,  [$,  <U], . . . ,  [qk,  dk\  of  G.  From  Property  1 
of  G  (applied  to  every  vertex  of  the  path  g),  there  is  a  sequence  T  =  ti ,  <2>  •  •  • » U, . . . ,  tk 
of  times  such  that  for  all  i,  1  <  i  <  k,  di  =  H-p(qi,ti)  and  q{  £  F(ti).  From  Property 
2  of  G  (applied  to  every  edge  of  the  path  g),  for  all  i,  1  <  i  <  k,  t,  <  ti+1.  Thus  T 
is  a  sequence  of  increasing  times,  and,  by  definition,  ( F,H-p,I,S,T )  is  a  partial  run  of 
Consensus z>.  □ 

Lemma  8:  Let  S  be  a  schedule  associated  with  an  infinite  path  of  that  starts  from 
the  root.  If  in  S  every  correct  process  takes  an  infinite  number  of  steps  and  every  mes¬ 
sage  sent  to  a  correct  process  is  eventually  received,  there  is  a  sequence  of  times  T  such 
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that  (F, 77©,  7, 5, T)  is  a  run  of  Consensus- 

Proof:  Similar  to  Lemma  7.  □ 

The  following  lemmata  show  some  “richness”  properties  of  the  simulation  trees  induced 
by  G. 

Lemma  9:  For  any  two  initial  configurations  7  and  if  S  is  a  vertex  of  and  is 
applicable  to  I'  then  5  is  also  a  vertex  of  T^. 

Proof:  Follows  directly  from  the  definitions.  □ 

Lemma  10:  Let  5  be  any  vertex  of  and  p  be  any  correct  process.  Let  m  be  a 
message  in  the  message  buffer  of  5(7)  addressed  to  p  or  the  null  message.  For  some  d, 
5  has  a  child  5  •  (p,  m,  d)  in  Y^. 

Proof:  From  the  definition  of  T^,  5  is  compatible  with  some  finite  path  g  of  G  and 
applicable  to  7.  Let  v  denote  the  last  vertex  of  g.  By  Property  4,  there  is  a  d  such 
that  v  — ►  \p,d\  is  an  edge  of  G.  Therefore,  g  •  [p,  d]  is  a  path  of  G,  and  5  •  (p,m,d)  is 
compatible  with  G. 

It  remains  to  show  that  5  •  ( p,  m,  d )  is  applicable  to  7.  Since  5  is  applicable  to  7,  it 
suffices  to  show  that  (p,  m,  d)  is  applicable  to  5(7).  But  this  is  true  since,  by  hypothesis, 
m  is  in  the  message  buffer  of  5(7)  and  addressed  to  p,  or  the  null  message.  □ 

Lemma  11:  Let  5  be  any  vertex  of  and  p  be  any  process.  Let  m  be  a  message  in 
the  message  buffer  of  5(7)  addressed  to  p  or  the  null  message.  Let  5'  be  a  descendent 
of  5  such  that,  for  some  d,  S'  •  (p,m,d)  is  in  T^.  For  each  vertex  5"  on  the  path  from 
5  to  S'  (inclusive),  5"  •  (p,m,d)  is  also  in  T^. 

Proof:  Since  they  are  vertices  of  T^,  5,  5"  and  S'  •  (p,  m,  d)  are  compatible  with  some 
finite  paths  g,  g  •  g"  and  g  •  g"  •  g1  •  [p,  d]  of  G,  respectively.  From  Property  3  (transitive 
closure)  of  G,  g  ■  g"  •  [p,d]  is  also  a  path  of  G.  So  S"  -  (p,m,d)  is  compatible  with  this 
path  of  G.  We  now  show  that  S"  •  (p,m,d)  is  also  applicable  to  7,  and  therefore  it  is  a 
vertex  of  T^- 

Since  S"  is  a  vertex  of  Tq,  5"  is  applicable  to  7.  If  m  =  A,  then  (p,  m,  d)  is  obviously 
applicable  to  5"(7).  Now  suppose  m  ^  A.  Since  S'  •  (p,m,  d)  is  a  vertex  of  T^,  (p,m,  d) 
is  applicable  to  5'(7),  and  thus  m  is  in  the  message  buffer  of  S'(I).  Since  each  message 
is  sent  at  most  once  and  m  is  in  the  message  buffers  of  5(7)  and  5'(7),  there  is  no  edge 
of  the  type  (p,  m,  -)  on  the  path  from  5  to  S'.  So  m  is  also  in  the  message  buffer  of 
S"(7),  and  (p,m,d)  is  applicable  to  S"(I).  □ 

Lemma  12:  Let  5, 5o,  and  Si  be  any  vertices  of  T^.  There  is  a  finite  schedule  E 
containing  only  steps  of  correct  processes  such  that: 

1.  S  •  E  is  a  vertex  of  and  all  correct  processes  have  decided  in  5  •  E(I). 

2.  For  i  =  0, 1,  if  E  is  applicable  to  5j(7)  then  Si  -  E  is  a  vertex  of  T^- 
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j*~  0 

S°  *—  S  { S°  is  compatible  with  g  and  applicable  to  I } 

repeat  forever 

j  j  +  1 

Let  [qj,dj]  be  the  j- th  vertex  of  path  g 

Let  my  be  the  oldest  message  addressed  to  qj  in  the  message  buffer  of  Sji_1(/) 

(if  no  such  message  exists,  my  =  A) 

ey  (9y,my,dy) 

SJ  <—  Sj~l  •  ey  {5j  is  compatible  with  g  ■  [qi,  di]  ••••*,  [qy,  dy]  and  applicable  to  1} 


Figure  2:  Generating  schedule  5  •  E°°,  compatible  with  path  g  ■  g^,  in 


PROOF:  Since  5  is  a  vertex  of  T£,  5  is  compatible  with  some  finite  path  g  of  G  and 
is  applicable  to  I.  Similarly,  50  and  Si  are  compatible  with  some  finite  path  g0  and  g j, 
respectively,  of  G.  From  Lemma  5  (applied  to  the  last  vertices  of  q,q0  and  9i),  G  has 
an  infinite  path  g «,  =  [qi,  dx],  [q2,  d2], . . . ,  [qy,  dy], . . .  with  the  following  two  properties: 

1.  There  is  an  edge  from  the  last  vertex  of  g ,  q0  and  gi  to  the  first  vertex  of  g^.  (Thus, 
9  ■  9oo,  go  •  9oo,  and  qi  •  g^  are  infinite  paths  in  G.) 

2.  If  [p,  -j  is  a  vertex  of  g^  then  p  is  correct;  for  each  correct  p,  there  are  infinitely 
many  vertices  [p,  -]  in  g^. 

We  now  show  how  to  construct  the  required  schedule  E.  Consider  the  infinite  sequence 
of  schedules  5°,  S1,  S2, . . . ,  5J, . . .  constructed  by  the  algorithm  in  Figure  2.  An  easy 
induction  shows  that  for  all  j  >  0,  5^  is  applicable  to  I  and  is  compatible  with  g-[qi,  dt]  • 
. . .  •  [qy,dy],  a  prefix  of  the  path  g  •  in  G.  So,  for  all  j  >  0,  5;  is  a  vertex  of  T^. 
Consider  the  infinite  path  of  that  starts  from  the  root  of  then  goes  to  S°  =  5, 
and  then  to  S1, 52, . . . ,  S;, . . .  The  infinite  schedule  associated  with  that  path  is  S°°  = 
S  •  e\  •  e2  •  • . .  •  Cy  . . .  Note  that  schedule  E 00  =  e\  •  e2  • . . .  •  ey . . .  is  compatible  with  path 
g^  of  G.  By  Property  (2)  of  path  goo,  every  correct  process  p  takes  an  infinite  number 
of  steps  in  E°°  (and  thus  also  in  S00  =  S  •  E°°).  Since  in  each  one  of  these  steps  p 
receives  the  oldest  message  that  is  addressed  to  it,  every  message  sent  to  p  (in  S°°)  is 
eventually  received.  By  Lemma  8,  there  is  a  T  such  that  R  =  (F,  Hz>,  I,  S°°,  T)  is  a  run 
of  Consensus®. 

From  the  termination  requirement  of  Consensus,  S°°  has  a  finite  prefix  Sd  such  that 
all  correct  processes  have  decided  in  Sd(I).  There  are  two  cases: 

•  Sd  is  a  prefix  of  5.  Since  decisions  are  irrevocable,  all  correct  processes  remain 
decided  in  5(7).  Thus  5 1,  the  empty  schedule,  is  the  required  E. 
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•  S  is  a  prefix  of  Sd.  Thus,  Sd  =  S  •  E  where  E  is  a  finite  prefix  of  E°°.  Since  E 00 
is  compatible  with  E  is  compatible  with  a  prefix  of  g^.  Now  consider  So  (the 
following  argument  also  applies  to  Si).  Since  So  is  compatible  with  go,  So  •  E  is 
compatible  with  a  prefix  of  g0  •  g^,  a  path  in  G.  So,  So  •  E  is  compatible  with  G. 
If  So  •  E  is  also  applicable  to  /,  then,  by  the  definition  of  T^,  it  is  a  vertex  of  T£. 
The  same  argument  holds  for  St.  It  remains  to  show  that  E  contains  only  steps 
of  correct  processes.  This  is  immediate  from  Property  (2)  of  g «  and  from  the  fact 
that  E  is  compatible  with  a  prefix  of  g^.  □ 

Let  /*,  0  <  t  <  n,  denote  the  initial  configuration  of  Consensus t>  which  the  initial 
values  of  pi . . .  pt-  are  1,  and  the  initial  values  of  pl+1 . . .  pn  are  0.  The  simulation  forest 
induced  by  G  is  the  set  {T£,T£,...,T£*}  of  simulation  trees  induced  by  G  and  initial 
configurations  1°,  7l, . . . ,  In. 

0.2  Tagging  the  simulation  forest 

We  assign  a  set  of  tags  to  each  vertex  of  every  tree  in  the  simulation  forest  induced  by 
G.  Vertex  5  of  tree  T£  gets  tag  k  if  and  only  if  it  has  a  descendent  S'  such  that  some 
correct  process  has  decided  k  in  S’ (I).  Hereafter,  T*  denotes  the  tagged  tree  T £,  and  T 
denotes  the  tagged  simulation  forest  {T°,  T1, . . . ,  Tn}. 

Lemma  13:  Every  vertex  of  T‘  has  at  least  one  tag. 

Proof:  From  Lemma  12,  every  vertex  S  of  T’  has  a  descendent  S’  =  S  •  E  (for  some 
E)  such  that  all  correct  processes  have  decided  in  S' (I').  □ 

A  vertex  of  T’  is  monovalent  if  it  has  only  one  tag,  and  bivalent  if  it  has  both  tags,  0  and 
1.  A  vertex  is  0-valent  if  it  is  monovalent  and  is  tagged  0;  1-valent  is  similarly  defined. 

Lemma  14:  Every  vertex  of  T'  is  either  0-valent,  1-valent,  or  bivalent. 

Proof:  Immediate  from  Lemma  13.  □ 

Lemma  15:  The  ancestors  of  a  bivalent  vertex  are  bivalent.  The  descendents  of  a  k- 
valent  vertex  are  k- valent. 

Proof:  Immediate  from  the  definitions.  □ 

Lemma  16:  If  vertex  5  of  T*  has  tag  fc,  then  no  correct  process  has  decided  1  —  k  in 
S(P). 

Proof:  Since  S  has  tag  k,  it  has  a  descendent  S'  such  that  a  correct  process  p  has 
decided  k  in  S'(J').  From  Lemma  7,  there  is  a  I  such  that  R  =  ( F,H-p,I\Sr,T ) 
is  a  partial  run  of  Consensus p.  Since  p  has  decided  k  in  S'(/‘),  from  the  agreement 
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requirement  of  Consensus,  no  correct  process  has  decided  1  —  k  in  S'(P).  Since  S'  is  a 
descendent  of  S,  no  correct  process  could  have  decided  1  —  fc  in  S(P).  □ 

Lemma  17:  If  vertex  S  of  T*  is  bivalent,  then  no  correct  process  has  decided  in  S(P). 


Proof:  Immediate  from  Lemma  16.  □ 

Recall  that  in  1°  all  processes  have  initial  value  0,  while  in  Jn  they  all  have  initial  value 
1. 

Lemma  18:  The  root  of  T°  is  0-valent;  the  root  of  Tn  is  1-valent. 

Proof:  We  first  show  that  the  root  of  T°  is  0-valent.  Suppose,  for  contradiction,  that 
the  root  of  T°  has  tag  1.  There  must  be  a  vertex  S  of  T°  such  that  some  correct  process 
has  decided  1  in  S(/°).  From  Lemma  7,  there  is  a  T  such  that  R  =  (F,Hv,I°,S,T) 
is  a  partial  run  of  Consensus p.  R  violates  the  validity  requirement  of  Consensus — a 
contradiction.  Thus  the  root  of  T°  cannot  have  a  tag  of  1.  From  Lemma  13,  the  root  of 
T°  has  at  least  one  tag:  thus  it  is  0-valent. 

By  a  symmetric  argument,  the  root  of  Tn  is  1-valent.  □ 

Index  i  is  critical  if  the  root  of  T*  is  bivalent,  or  if  the  root  of  T’-1  is  0-valent  while  the 
root  of  T’  is  1-valent.  In  the  first  case,  we  say  that  index  t  is  bivalent  critical-,  in  the 
second  case,  we  say  that  i  is  monovalent  critical 

Lemma  19:  There  is  a  critical  index  *,  0  <  »  <  n. 

Proof:  Apply  Lemmata  14  and  18  to  the  roots  of  T°,  T1, . . . ,  Tn.  □ 

The  critical  index  *  is  the  key  to  extracting  the  identity  of  a  correct  process.  In  fact,  if  i 
is  monovalent  critical,  we  shall  prove  that  must  be  correct  (Lemma  21).  If  i  is  bivalent 
critical,  the  correct  process  will  be  found  by  focusing  on  the  tree  T*,  as  explained  in  the 
following  section. 

6.3  Of  hooks  and  forks 

We  describe  two  types  of  finite  subtrees  of  T’  referred  to  as  decision  gadgets  of  T*.  Each 
type  of  decision  gadget  is  rooted  at  the  root  S±  of  T*  and  has  exactly  two  leaves:  one 
0-valent  and  one  1-valent.  The  least  common  ancestor  of  these  leaves  is  called  the  pivot. 
The  pivot  is  clearly  bivalent. 

The  first  type  of  decision  gadget  is  called  a  fork,  and  is  shown  in  Figure  3.  The  two 
leaves  are  children  of  the  pivot,  obtained  by  applying  different  steps  of  the  same  process 
p.  Process  p  is  the  deciding  process  of  the  fork ,  because  its  step  after  the  pivot  determines 
the  decision  of  correct  processes. 
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Root 

O 

5x 


S-(p,m,d) 


S‘(p ,  rn',d!) 


Figure  3:  A  fork — p  is  the  deciding  process 


Root 

O 

Sx 


{0} 


Figure  4:  A  hook — p  is  the  deciding  process 
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S  * —  Sj_  {Si  is  the  bivalent  root  of  T*} 

repeat  forever 

Let  p  be  the  next  correct  process  in  round- robin  order 

Let  m  be  the  oldest  message  addressed  to  p  in  the  message  buffer  of  S(Il) 

(if  no  such  message  exists,  m  =  A) 

if  S  has  a  descendent  S'  such  that,  for  some  d,  S'  •  (p,  m,  d)  is  a  bivalent  vertex  in  T* 
then  S  *—  S'  •  ( p,m,d )  {S  is  bivalent} 

else  exit 


Figure  5:  Generating  path  ir  in  T* 


The  second  type  of  decision  gadget  is  called  a  hook,  and  is  shown  in  Figure  4.  Let  S 
be  the  pivot  of  the  hook.  There  is  a  step  e  such  that  S  •  e  is  one  leaf,  and  the  other  leaf  is 
5  •  (p,  m,d)-e  for  some  p,  m,  d.  Process  p  is  the  deciding  process  of  the  hook,  because  the 
decision  of  correct  processes  is  determined  by  whether  p  takes  the  step  (p,  m,  i)  before 
e. 

We  shall  prove  that  the  deciding  process  p  of  a  decision  gadget  must  be  correct 
(Lemma  23).  Intuitively,  this  is  because  if  p  crashes  no  process  can  figure  out  whether 
p  has  taken  the  step  that  determines  the  decision  value.  The  existence  of  such  a  critical 
“hidden”  step  is  also  at  the  core  of  many  impossibility  proofs  starting  with  [FLP85]. 
In  our  case,  the  “hiding”  is  more  difficult  because  now  processes  have  recourse  to  the 
failure  detector  D.  Despite  this,  the  hiding  of  the  step  of  the  deciding  process  of  a 
decision  gadget  is  still  possible. 

Lemma  20:  If  index  i  is  bivalent  critical  then  T‘  has  at  least  one  decision  gadget  (and 
hence  a  deciding  process). 

Proof:  Starting  from  the  bivalent  root  of  T’,  we  generate  a  path  n  in  T\  all  the 
vertices  of  which  are  bivalent,  as  follows.  We  consider  all  correct  processes  in  round- 
robin  fashion.  Suppose  we  have  generated  path  S  so  far,  and  it  is  the  turn  of  process  p. 
Let  m  be  the  the  oldest  message  destined  to  p  that  is  in  the  message  buffer  of  5(/‘).8  (If 
no  such  message  exists,  we  take  m  to  be  the  null  message.)  We  try  to  extend  the  path  5 
so  that  the  last  edge  in  the  extension  corresponds  to  p  receiving  m  and  the  target  of  that 
edge  is  a  bivalent  vertex.  The  path  construction  ends  if  and  when  such  an  extension  is 
no  longer  possible.  This  construction  is  shown  in  Figure  5.  Each  iteration  of  the  loop 
extends  the  path  by  at  least  one  edge.  Let  n  be  the  path  generated  by  these  iterations; 
7r  is  finite  or  infinite  depending  on  whether  the  loop  terminates. 

Claim  1:  n  is  finite. 

Proof:  Suppose,  for  contradiction,  that  ir  is  infinite.  Let  S  be  the  schedule  associated 
®By  a  slight  abuse  of  notation  we  identify  a  finite  path  from  the  root  and  its  associated  schedule. 
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Figure  6:  The  decision  gadgets  in  T*  if  i  is  bivalent  critical 


with  ir.  By  construction,  in  S  every  correct  process  takes  an  infinite  number  of  steps  and 
every  message  sent  to  a  correct  process  is  eventually  received.  By  Lemma  8,  there  is  a 
T  such  that  R  —  ( F ,  H-p,  S,  T)  is  a  run  of  Consensus^.  By  construction,  all  vertices 
in  7T  are  bivalent.  By  Lemma  17,  no  correct  process  decides  in  R,  thus  violating  the 
termination  requirement  of  Consensus — a  contradiction.  '-'claim  1 

Let  S  be  the  last  vertex  of  ir  (clearly,  5  is  bivalent).  Let  p  be  the  next  correct  process 
in  round-robin  order  when  the  loop  in  Figure  5  terminates.  Let  m  be  the  oldest  message 
addressed  to  p  in  the  message  buffer  of  S(I')  (if  no  such  message  exists,  m  is  the  null 
message).  The  loop  exit  condition  and  Lemma  14  imply  that 

All  descendents  S'  •  {p,  m,  -)  of  S  are  monovalent.  (*) 

From  Lemma  10,  for  some  d,  S  has  a  child  S  •  ( p,m,d )  in  T*.  By  (*),  S  •  (p,  m,  d)  is 
monovalent.  Without  loss  of  generality,  assume  it  is  0-valent. 

Claim  2:  For  some  <f  there  is  a  descendent  S'  of  S  such  that  S'  •  (p,  m,<f)  is  a  1-valent 
vertex  of  T\  and  the  path  from  S  to  S'  contains  no  edge  labeled  (p,  m,  -). 
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Proof:  Since  S  is  bivalent,  it  has  a  descendent  S’  such  that  some  correct  process  has 
decided  1  in  S*(/‘).  From  Lemmata  13  and  16,  S*  is  1-valent.  There  axe  two  cases: 

1.  The  path  from  S  to  S*  does  not  have  an  edge  labeled  (p,  m,  — ).  Suppose  m  ^  A. 
Since  m  is  in  the  message  buffer  of  S(/‘)  and  p  does  not  receive  m  in  the  path 
from  S  to  S’,  m  is  still  in  the  message  buffer  of  S*(/’).  From  Lemma  10,  for  some 
d',  S*  •  ( p,m,d ')  is  in  T‘.  Since  S*  is  1-valent,  by  Lemma  15,  S*  •  (p,  m,d!)  is  also 
1-valent.  In  this  case,  the  required  S'  is  S*. 

2.  The  path  from  5  to  S*  has  an  edge  labeled  (p, m, -).  Let  (p, m, d')  be  the  first 

such  edge  on  that  path.  Let  S'  be  the  source  of  this  edge.  By  (*),  S'  ■  (p,m,d') 
is  monovalent.  Since  S'  •  (p,m,d')  has  a  1-valent  descendent  S’,  by  Lemma  15, 
S'  •  (p,  m,  d')  is  1-valent.  Dclaim  2 

Consider  the  vertex  S'  and  edge  (p,  m,  d!)  of  Claim  2.  By  Lemma  11,  for  each  vertex  S" 
on  the  path  from  S  to  S'  (inclusive),  S"  •  (p,  m,  d')  is  also  in  T‘.  By  (*),  all  such  vertices 
S"  •  ( p,m,d ')  are  monovalent.  In  particular,  S  •  (p,  m,  d')  is  monovalent.  There  are  two 
cases  (see  Figure  6): 

1.  S  •  ( p,m,d ')  is  1-valent.  Since  S  •  ( p,  m,d )  is  0- valent,  T*  has  a  fork  with  pivot  S. 

2.  S  •  (p,  m,d!)  is  0-valent.  Recall  that  S'  •  ( p,m,d ')  is  1-valent  and  for  each  vertex 

S"  between  S  and  S',  S"  •  (p,  m,d!)  is  monovalent.  Thus,  the  path  from  S  to  S' 
must  have  two  vertices  S0  and  S\  such  that  So  is  the  parent  of  Si,  S0  ■  (p,  m,  d')  is 
0-valent  and  Sx  •  ( p,m,d ')  is  1-valent.  Hence,  T’  has  a  hook  with  pivot  S0.  □ 

6.4  Extracting  the  correct  process 

By  Lemma  19,  there  is  a  critical  index  i.  If  i  is  monovalent  critical,  Lemma  21  below 
shows  how  to  extract  a  correct  process.  If  i  is  bivalent  critical,  a  correct  process  can  be 
found  by  applying  Lemmata  20  and  23. 

Lemma  21:  If  index  i  is  monovalent  critical  then  p,  is  correct. 

Proof:  Suppose,  for  contradiction,  that  p{  crashes.  By  Lemma  12(1)  (applied  to  the 
root  S  =  Si  of  T’),  there  is  a  finite  schedule  E  that  contains  only  steps  of  correct 
processes  (and  hence  no  step  of  p*)  such  that  all  correct  processes  have  decided  in  E(I'). 
Since  index  i  is  monovalent  critical,  the  root  Si  of  T’  is  1-valent.  Hence  all  correct 
processes  must  have  decided  1  in  E(P). 

P  and  P~l  only  differ  in  the  state  of  pi.  Since  S  is  applicable  to  P  and  does  not 
contain  any  steps  of  pi}  an  easy  induction  on  the  number  of  steps  in  S  shows  that:  (a)  S 
is  also  applicable  to  P~l,  and  (b)  the  state  of  all  processes  other  than  p,  axe  the  same  in 
S(P)  and  S(P~l).  Using  Lemma  9,  (a)  implies  that  S  is  also  a  vertex  of  T*-1.  By  (b), 
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Figure  7:  Lemma  22 


all  correct  processes  have  decided  1  in  5(7*  l).  Thus  the  root  of  T’-1  has  tag  1.  Since  i 
is  monovalent  critical,  the  root  of  T‘_1  is  0-valent — a  contradiction.  □ 

Lemma  22:  Let  5  be  any  bivalent  vertex  of  T\  and  S0,  Si  be  any  0-valent  and  1-valent 
descendents  of  S.  If  the  paths  from  S  to  So  and  from  S  to  Si  contain  only  steps  of  the 
form  ( p ,  -,  -),  then  p  is  correct. 

Proof:  Suppose,  for  contradiction,  that  p  crashes.  From  Lemma  12,  there  is  a  schedule 
E  containing  only  steps  of  correct  processes  (and  hence  no  step  of  p)  such  that: 

1.  S  •  E  is  a  vertex  of  T'  and  all  correct  processes  have  decided  in  S  •  E(I'). 

2.  For  k  =  0, 1,  if  S*  •  E  is  applicable  to  P  then  S*  •  E  is  a  vertex  of  T*. 

Without  loss  of  generality  assume  that  all  correct  processes  decided  0  in  5  •  E(I').  Refer 
to  Figure  7.  Since  all  steps  in  the  path  from  S  to  Si  are  steps  of  p,  the  state  of  every 
process  other  than  p  is  the  same  in  5(7')  and  in  Si(7’).  Furthermore,  any  message 
addressed  to  a  process  other  than  p  that  is  in  the  message  buffer  in  5(7’)  is  still  in  the 
message  buffer  in  Si(7’).  Since  E  is  applicable  to  5(7')  and  does  not  contain  any  steps 
of  p,  an  easy  induction  on  the  number  of  steps  in  E  shows  that:  (a)  E  is  also  applicable 
to  5i(7‘),  and  (b)  the  state  of  every  process  other  than  p  is  the  same  in  5  •  E(P)  and 
5i  •  E{P).  By  (ii),  (a)  implies  that  Si  •  E(P)  is  a  vertex  in  T’.  By  (b),  all  correct 
processes  decide  0  in  Si  •  E(P).  Thus  Si,  has  tag  0.  But  Si  is  1-valent — a  contradiction. 
□ 

Lemma  23:  The  deciding  process  of  a  decision  gadget  is  correct. 

PROOF:  Let  7  be  any  decision  gadget  of  T'.  There  are  two  cases  to  consider: 
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)0  S'  =  5  •  (p,  m,  d) 

Q^Si  =  S'  •  ( p',m',d ')  (1-valent) 


Figure  8:  Lemma  23 


1.  7  is  a  fork.  By  Lemma  22,  the  deciding  process  of  7  is  correct. 

2.  7  is  a  hook.  Assume  (without  loss  of  generality)  that  S  is  the  pivot  of  7,  So  = 
S-(p',m',d')  is  the  0-valent  leaf  of7  and  Si  =  S-(p,Tn,d)-(p',m',d')  is  the  1-valent 
leaf  of  7  (see  Figure  8).  There  are  two  cases: 

(a)  p  =  p'.  By  Lemma  22,  p  is  correct. 

(b)  p  ^  pf ’.  Suppose,  for  contradiction,  that  p  crashes.  By  Lemma  12,  there  is  a 
schedule  E  containing  only  steps  of  correct  processes  (and  hence  no  step  of 
p)  such  that: 

i.  So  •  E  is  a  vertex  of  T*  and  all  correct  processes  have  decided  in  Sq-E(P). 

Since  So  is  0-valent,  all  correct  processes  must  have  decided  0  in  Sq-E(P). 

ii.  If  E  is  applicable  to  Sx( /*)  then  Si  •  E  is  a  vertex  of  T*. 

Let  S'  =  S-(p,  m,  d)  be  the  parent  of  Si.  The  state  of  every  process  other  than 
p  is  the  same  in  S(P)  and  S'(P).  Furthermore,  any  message  addressed  to  a 
process  other  than  p  that  is  in  the  message  buffer  in  S( /’)  is  still  in  the  message 
buffer  in  S'(P).  Therefore,  since  So  =  S  •  (p',m',d')  and  Si  =  S'  •  (p',m',d'), 
the  state  of  every  process  other  than  p  is  the  same  in  So(/’)  and  Si(/‘).  In 
addition,  any  message  addressed  to  a  process  other  than  p  that  is  in  the 
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{Build  and  tag  simulation  forest  T  induced  by  G} 
for  t  <—  0, 1, . . .  ,n: 

T’  «—  simulation  tree  induced  by  G  and  /' 
for  every  vertex  5  of  T' 

if  5  has  a  descendent  5'  such  that  a  correct  process  has  decided  k  in  S'(P) 
then  add  tag  k  to  S 

(1) 

(2) 

(3) 

Figure  9:  Selecting  a  correct  process 


{Select  a  process  from  tagged  simulation  forest  Y } 

i «—  smallest  critical  index 

if  i  is  monovalent  critical  then  return  pi 

else  return  deciding  process  of  the  smallest  decision  gadget  in  T* 


message  buffer  in  5o(/')  is  also  in  the  message  buffer  in  S\(IX).  Since  E  is 
applicable  to  S0(P)  and  does  not  contain  any  steps  of  p,  an  easy  induction 
on  the  number  of  steps  in  E  shows  that:  (a)  E  is  also  applicable  to  Si(P), 
and  (b)  the  state  of  every  process  other  than  p  is  the  same  in  S0  •  E(P)  and 
Si  •  E(P).  By  (ii),  (a)  implies  that  Si  •  E  is  a  vertex  of  T\  By  (b),  all 
correct  processes  decide  0  in  Si  •  E(P).  Thus  Si,  receives  a  tag  of  0.  But  Sj 
is  1-valent — a  contradiction.  □ 

There  may  be  several  critical  indices  and  several  decision  gadgets  in  the  simulation  forest. 
Thus,  Lemmata  21  and  23  may  identify  many  correct  processes.  Our  selection  rule  will 
choose  one  of  these,  as  the  failure  detector  fl  requires,  as  follows.  It  first  determines 
the  smallest  critical  index  i.  If  i  is  monovalent  critical,  it  selects  p<.  If,  on  the  other 
hand,  i  is  bivalent  critical,  it  chooses  the  “smallest”  decision  gadget  in  T'  according  to 
some  encoding  of  gadgets,  and  selects  the  corresponding  deciding  process.  It  is  easy  to 
encode  finite  graphs  as  natural  numbers.  Since  a  decision  gadget  is  just  a  finite  graph, 
the  selection  rule  can  use  any  such  encoding.  The  whole  method  of  selecting  a  correct 
process  is  shown  in  Figure  9. 

Theorem  24:  The  algorithm  in  Figure  9  selects  a  correct  process. 

Proof:  By  Lemma  19,  there  is  a  critical  index  i,  0  <  i  <  n.  If  i  is  monovalent  critical, 
Line  2  returns  p,  which,  by  Lemma  21,  is  correct.  If  i  is  bivalent  critical,  by  Lemma  20, 
T*  contains  at  least  one  decision  gadget.  Let  7  be  the  decision  gadget  in  T’  with  the 
smallest  encoding.  By  Lemma  23,  the  deciding  process  of  7  is  correct  in  F.  Thus,  Line 
3  returns  the  identity  of  a  process  that  is  correct.  □ 


25 


6.5  The  reduction  algorithm 

The  selection  of  a  correct  process  described  in  Figure  9  is  not  yet  the  distributed  algo¬ 
rithm  Tp_n  that  we  are  seeking:  it  involves  an  infinite  simulation  forest  and  is  “central¬ 
ized”.  To  turn  it  into  a  distributed  algorithm,  we  will  modify  it  as  follows.  Each  process 
will  cooperate  with  other  processes  to  construct  ever  increasing  finite  approximations  of 
the  simulation  forest.  Such  approximations  will  eventually  contain  the  decision  gadget 
and  the  other  tagging  information  necessary  to  extract  the  identity  of  the  same  correct 
process  chosen  by  the  selection  method  in  Figure  9. 

Note  that  the  selection  method  in  Figure  9  involves  three  stages:  The  construction  of 
G,  a  graph  representing  samples  of  failure  detector  values  and  their  temporal  relationship; 
the  construction  and  tagging  of  the  simulation  forest  induced  by  G\  and,  finally,  the 
selection  of  a  correct  process  using  this  forest. 

Algorithm  Tx>_n  consists  of  two  components.  In  the  first  component,  each  process 
repeatedly  queries  its  failure  detector  module  and  sends  the  failure  detector  values  it 
sees  to  the  other  processes.  This  component  enables  correct  processes  to  construct  ever 
increasing  finite  approximations  of  the  same  G.  Since  all  inter- process  communication 
occurs  in  this  component,  we  call  it  the  communication  component  of  Tz>_n- 

In  the  second  component,  each  process  repeatedly  (a)  constructs  and  tags  the  simu¬ 
lation  forest  induced  by  its  current  approximation  of  G,  and  (b)  selects  the  identity  of 
a  process  using  its  current  simulation  forest.  Since  this  component  does  not  require  any 
communication,  we  call  it  the  computation  component  of  Tv—n- 

6.5.1  The  communication  component 

In  this  component  processes  cooperate  to  construct  ever  increasing  approximations  of 
the  same  G.  Let  Gp  denote  p’s  current  approximation  of  G.  Roughly  speaking,  each 
process  p  periodically  performs  the  following  two  tasks:  (i)  If  p  receives  Gq  for  some  q,  it 
incorporates  this  information  by  replacing  Gp  with  the  union  of  Gp  and  Gq.  (ii)  Process 
p  queries  its  own  failure  detector  module.  Let  d  be  the  value  that  it  sees  and  [p',  d'\  be 
any  vertex  currently  in  Gp.  Clearly,  p  saw  d  after  p'  saw  d! .  Thus  p  adds  [p,  d]  to  Gp, 
with  edges  from  all  other  vertices  of  Gp  to  [p,  d\.  Process  p  then  sends  its  updated  Gp  to 
all  other  processes.  The  communication  component  of  Tv~. n  for  p  is  shown  in  Figure  10. 

Let  Gp(t)  denote  the  value  of  Gp  at  time  t.  If  p  takes  a  step  at  time  f,  Gp(t)  denotes 
the  value  of  Gp  at  the  end  of  that  step.  The  next  two  lemmata  establish  certain  useful 
properties  of  the  graphs  constructed  by  the  communication  component.  In  reading  the 
proofs  of  these  results  it  will  be  useful  to  keep  in  mind  that  in  our  model  the  three  phases 
of  a  step  —  receive,  failure  detection  query,  and  send  —  occur  atomically  at  a  single  time 
t ,9 

*  As  mentioned  in  footnote  4,  our  results  would  be  valid  in  a  model  where  process  steps  have  a  finer 
granularity.  In  such  a  model  the  proofs  of  Lemmata  25  and  26  below  would  be  the  same  in  essence, 
although  some  of  the  details  would  be  different  to  account  for  the  fact  that  the  three  phases  of  what  is 
now  considered  an  atomic  step  would  not  necessarily  take  place  at  the  same  time. 
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{Build  the  directed  acyclic  graph  Gp } 

Gp  *-  empty  graph 
repeat  forever 
Receive  phase: 
p  receives  to 

Failure  detector  query  phase: 

dp  «—  query  failure  detector  V 
Send  phase: 

if  to  is  of  the  form  ( q ,  Gq,p)  then 


Gp  *—  Gp  U  Gq  ( 1 ) 

add  [p,  dp]  to  Gp  and  edges  from  all  ether  vertices  of  Gp  to  [p,  dp ]  (2) 

output p  ♦—  computation  component  {Figure  11}  (3) 

p  sends  (p,  Gp,  q)  to  all  q  6  II  (4) 


Figure  10:  Process  p’s  communication  component 


Lemma  25:  Let  v  be  a  vertex  contained  in  -ome  local  graph  during  the  execution  of 
the  communication  component.  Le*  Gp(t)  be  the  first  graph  that  contains  v.  (That  is, 
v  is  in  Gp{t),  but  not  in  for  any  process  q  and  time  t'  <  t.)  Then 

1.  v  =  [p, d],  and  p  saw  d  at  time  t. 

2.  If  u  — ♦  v  is  an  edge  contained  in  some  local  graph  during  the  execution  of  the 
communication  component  then  u  — *  v  is  contained  in  Gp(t). 

3.  Gp(t)  is  a  subgraph  of  any  graph  that  contains  v. 

Proof:  1.  Process  p  adds  v  into  Gp{t)  in  Line  (1)  or  (2).  In  the  latter  case,  the  result 
follows  immediately.  In  the  former  case,  p  must  have  received  a  message  at  time  t  with 
a  graph  that  contains  v.  The  process  that  sent  that  message  must  have  therefore  had  v 
in  its  graph  before  time  t,  contradicting  the  choice  of  Gp(t)  as  the  first  graph  to  contain 

v. 

2.  Consider  the  earliest  time  t!  when  the  edge  u  — »  v  was  added  to  some  graph,  say  of 
process  q.  By  definition  of  t,t  >t.  If  if  >  t,  at  time  f  process  p'  receives  a  message  that 
contains  a  graph  with  the  edge  u  — ►  v.  The  sender  of  that  message  had  a  graph  that 
contained  the  edge  u  —*  v  at  some  time  before  tr,  contrary  to  the  choice  of  t' .  Therefore 
it  must  be  that  f  =  t.  Then,  by  Part  (1),  q  =  p  and  so  u  — ♦  v  is  in  Gp(t),  as  wanted. 

3.  Suppose,  for  contradiction,  that  some  graph  contains  v  but  is  not  a  supergraph  of 
Gp(t).  Choose  the  first  such  graph,  say,  Gq(t').  By  definition  of  t,  t'  >  t.  Clearly,  q  ^  p 
because  p  never  removes  any  vertices  or  edges  from  its  own  graph.  Therefore,  at  time 
t'  process  q  receives  a  message  with  a  graph  that  contains  v  but  is  not  a  supergraph  of 
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Gp(t).  The  sender  of  that  message  must  have  had  a  graph  that  contains  v  but  is  not  a 
supergraph  of  Gp(t )  before  time  t',  contrary  to  the  choice  of  Gq(t').  a 

Recall  that  we  are  considering  a  fixed  run  of  n,  with  failure  pattern  F,  and  fail¬ 
ure  detector  history  Hp  €  V(i F).  We  now  prove  that  the  graphs  constructed  by  the 
communication  component  of  Tp^ n  satisfy  certain  properties.  The  reader  should  note 
the  similarity  between  the  first  four  and  the  four  properties  of  the  graphs  defined  in 
Section  6.1. 

Lemma  26:  For  any  correct  process  p  and  time  t: 

1.  The  vertices  of  Gp(t )  are  of  the  form  [pr,  d!\  where  p'  £  II  and  d!  £  TZp.  If  [p\  d']  is 
a  vertex  of  Gp(t),  then  there  is  a  time  t'  such  that  p'  £  F(t')  and  d'  =  Hp(p\  t'). 

2.  If  [gi,di]  — »  [q2,d.2]  is  an  edge  of  Gp(t)  and  dx  =  Hp{quti)  and  d2  =  Hp(q2,t2) 
then  ti  <  t2. 

3.  Gp(t)  is  transitively  closed. 

4.  There  is  a  time  t'  >  t  and  a  failure  detector  value  d  such  that  for  all  vertices  [p',  d'] 
of  Gp(t),  \p',d!}  -*■  \p,d]  is  an  edge  of  Gp(t'). 

5.  Gp{t)  is  a  subgraph  of  Gp(t'),  for  all  t'  >  t. 

6.  For  all  correct  q,  there  is  a  time  f  >  t  such  that  Gp(t )  is  a  subgraph  of  Gq{t'). 
Proof: 

Property  1  :  Consider  the  first  graph  that  contains  the  vertex  \p',d'].  By  Lemma  25(1), 
this  graph  is  Gj{d)  for  some  time  t',  and  p'  saw  d!  at  time  t' .  This  means  that 
p'  £  F(t')  (otherwise  p'  would  not  have  taken  a  step  at  time  t'  and  would  not  have 
seen  d'),  and  d!  =  Hp(p',t'),  as  wanted. 

Property  2  :  By  Lemma  25(2),  [gi,di]  -»  [q2,  d2\  is  an  edge  of  Gqj{t2).  Let  t'  be  the 
time  when  q2  inserted  vertex  [gi,di]  into  Gqi .  Of  course,  t'  <  t2.  There  are  two 
cases: 

1.  t!  <  t2.  By  Lemma  25(1),  [qi,di]  was  not  in  any  graph  before  time  ti-  Thus, 
tx  <  if  and  from  the  hypothesis  of  this  case,  ti  <  t2. 

2.  t'  =  t.  Then  q2  received  a  graph  containing  [?i,di]  at  t2.  Let  t"  be  the  time 
when  this  graph  was  sent.  Of  course,  t"  <  t2.  By  Lemma  25(1),  [qi,di]  was 
not  in  any  graph  before  t1}  and  therefore  <  t".  Thus,  ti  <  t2. 
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Property  3  :  Let  [<?i,di]  [qk,dk]  be  a  path  in  Gp(t).  We  must  show  that  there 

is  an  edge  [fc.di]  -*•  [qk,dk]  in  Gp{t). 

Let  ti  be  the  time  when  §,  inserted  [§,■,  dj]  in  Gqi ,  for  1  <  i  <  k.  By  induction  on  i  we 
show  that  [<Zi ,  <4]  — ♦  . . .  — >  [g-,  rf-]  is  a  path  in  Gqi(ti).  The  basis,  i  =  1,  is  trivial.  For 
the  induction  step,  suppose  that  [ql5  d x]  — ►  . . .  — *■  is  a  path  in  Gq<_,( t<_i). 

Since  — *■  [9«,d<]  is  an  edge  in  Gp(t),  by  Lemma  25(2),  it  is  also  an  edge 

in  Gqi(ti).  Since  <4—1.]  is  a  vertex  in  Gqi(U ),  by  Lemma  25(3),  G,._,(t,_ i)  is 
a  subgraph  of  Gqi(ti).  In  particular,  Gqi(U)  contains  the  path  [gi,di] 

[9i-x,di_i].  Thus,  [q\,di]  [$,<!<]  is  a  path  in  Gqi(U),  as  wanted. 

Therefore,  the  vertices  [qi,di], . . .  ,[qk)dk]  are  all  in  Gqh(tk).  At  time  tk,  qk  adds 
an  edge  from  every  other  vertex  to  [gk,dfc]-  Thus,  the  edge  [gi,<fi]  — ►  [qfc,<4j  is  in 
Gik  (tit).  By  Lemma  25(3),  Gqk(tk)  is  a  subgraph  of  Gp(t)  (since  the  latter  contains 
[?k,dk]).  Therefore,  [<?i,di]  — *•  [qk,dk\  is  in  Gp(t),  as  wanted. 

Property  5  :  Once  a  vertex  or  edge  is  added  to  Gp  it  is  not  removed. 

Property  4  :  Since  p  is  correct,  it  takes  a  step  at  some  time  t'  after  t.  In  the  failure 
detector  query  phase  of  this  step,  p  queries  its  failure  detector  module  and  obtains 
a  value,  say  d.  In  Line  2  of  this  step,  p  adds  the  vertex  [p,  d ]  to  Gp  and  an  edge 
from  all  other  vertices  of  Gp(t')  to  [p,d].  From  Property  5,  Gp{t)  is  a  subgraph  of 
Gp(t'),  hence  the  result  follows. 

Property  6  :  Since  p  is  correct,  it  eventually  sends  Gp{t)  to  all  processes,  including 
q  (this  occurs  in  p’s  first  execution  of  Line  4  after  time  t).  Since  q  is  correct,  it 
eventually  receives  Gp(t),  and  then  replaces  G,  with  G,  U  Gp(t),  say  at  time  t' .  So, 
Gp(t)  is  a  subgraph  of  Gq(H).  □ 

Property  5  of  the  above  lemma  allows  us  to  define  G“  =  Uter  Gp(t).  From  Property  6, 
we  get: 

Lemma  27:  For  any  correct  processes  p  and  q,  G“  =  G£°. 

Proof:  Let  o  be  any  vertex  or  edge  of  G£°,  i.e.,  there  is  a  time  t  at  which  o  is  in  Gp(t). 
From  Lemma  26  (6),  there  is  a  time  t'  such  that  Gp(t)  is  a  subgraph  of  Gq(t').  Thus  o 
is  in  G£°.  Thus  G£°  is  a  subgraph  of  G*.  By  a  symmetric  argument,  G“  is  a  subgraph 
of  G~,  hence  G~  =  G~.  □ 

Lemma  27  allows  us  to  define  the  limit  graph  G  to  be  G*  for  any  correct  process  p.  The 
first  four  properties  of  Lemma  26  immediately  imply: 

Lemma  28:  The  limit  graph  G  satisfies  the  four  properties  of  the  DAG  defined  in 
Section  6.1. 

As  before,  T*  denotes  the  tagged  simulation  tree  induced  by  G  and  initial  configuration 
/*,  and  T  denotes  the  tagged  simulation  forest  {T°,  T1, . . . ,  Tn}. 
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6.5.2  The  computation  component 

Since  the  limit  graph  G  has  the  four  properties  of  the  DAG,  we  can  apply  the  “central¬ 
ized”  selection  method  of  Figure  9  to  identify  a  correct  process.  This  method  involved: 

•  Constructing  and  tagging  the  infinite  simulation  forest  T  induced  by  G. 

•  Applying  a  rule  to  T  to  select  a  particular  correct  process  p". 

In  the  computation  component  of  Tp— n,  each  p  approximates  the  above  method  by 
repeatedly: 

•  Constructing  and  tagging  the  finite  simulation  forest  Tp  induced  by  Gp,  its  present 
finite  approximation  of  G. 

•  Applying  the  same  rule  to  Yp  to  select  a  particular  process. 

Since  the  limit  of  Tp  over  time  is  T,  and  the  information  necessary  to  select  p*  is  in  a 
finite  subgraph  of  T,  we  can  show  that  eventually p  will  keep  selecting  the  correct  process 
p*,  forever. 

Actually,  p  cannot  quite  use  the  tagging  method  of  Figure  9:  that  method  requires 
knowing  which  processes  are  correct!  Instead,  p  assigns  tag  fctoa  vertex  5  in  Tp  if  and 
only  if  5  has  a  descendent  S'  such  that  p  itself  has  decided  k  in  S'{I').  If  p  is  correct, 
this  is  eventually  equivalent  to  the  tagging  method  of  Figure  9.  If  p  crashes,  we  do  not 
care  how  it  tags  its  forest.  Also,  p  cannot  use  exactly  the  same  selection  method  as  that 
of  Figure  9:  its  current  simulation  forest  Tp  may  not  yet  have  a  critical  index  or  contain 
any  decision  gadget  (although  it  eventually  will!).  In  that  case,  p  temporizes  by  just 
selecting  itself.  The  computation  component  of  7z>_n  is  shown  in  Figure  11  (compare  it 
with  the  selection  method  of  Figure  9). 

We  first  show  that  Tp1  the  simulation  forest  that  p  constructs,  is  indeed  an  increas¬ 
ingly  accurate  approximation  of  T  (Lemma  29).  We  then  show  that  the  tags  that  p  gives 
to  any  vertex  5  in  Tp  are  eventually  the  same  ones  that  the  tagging  rule  of  Figure  9  gives 
to  5  in  T  (Lemma  30).  Let  Tp(t)  denote  Tp  at  time  t,  i.e.,  Tp(t)  is  the  finite  simulation 
forest  induced  by  Gp(t). 

Lemma  29:  For  any  correct  p  and  any  time  t: 

1.  Tp(t)  is  a  subgraph10  of  T. 

2.  Tp(t)  is  a  subgraph  of  Tp(t'),  for  all  t'  >  t. 

3.  UTP«)  =  T. 

I  €T 


l0The  subgraph  and  graph  equality  relations  ignore  the  tags. 
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{ Build  and  tag  simulation  forest  Tp  induced  by  Gp} 
for  * «—  0, 1, . . .  ,n: 

T p  <—  simulation  tree  induced  by  Gp  and  /’ 
for  every  vertex  5  of  Tp 

if  S  has  a  descendent  S'  such  that  p  has  decided  k  in  S'(J') 
then  add  tag  k  to  5 

(1) 
(2) 

(3) 

Figure  11:  Process  p’s  computation  component 


{Select  a  process  from  tagged  simulation  forest  Yp} 
if  there  is  no  critical  index  then  return  p 
else 

i  *—  smallest  critical  index 

if  i  is  monovalent  critical  then  return  p, 

else  if  Tp  has  no  decision  gadgets  then  return  p 

else  return  deciding  process  of  the  smallest  decision  gadget  in  Tj, 


Proof: 

Property  1  :  Let  5  be  any  vertex  of  tree  Tj,(t)  (for  some  t,  0  <  i  <  n).  From  the 
definition  of  Tp(t),  S  is  compatible  with  some  path  g  of  Gp(t)  and  applicable  to 
I'.  Since  Gp(t)  is  a  subgraph  of  G,  g  is  also  a  path  of  G.  Thus,  5  is  compatible 
with  G;  since  it  is  also  applicable  to  /*,  it  is  a  vertex  of  T‘. 

Similarly,  let  S  —*  S'  be  an  edge  e  of  Tp(<).  Since  S  and  S'  are  also  vertices  of  T’, 
and  S'  =  S  •  e,  S  — ►  S'  is  also  an  edge  of  T*. 

Property  2  :  Follows  from  Lemma  26  (5). 

Property  3  :  We  first  show  that  T  is  a  subgraph  of  Ut<=r  Tp(t).  Let  5  be  any  vertex  of 
any  tree  T‘  of  T.  From  the  definition  of  T‘,  5  is  compatible  with  some  finite  path 
9  of  G  and  is  applicable  to  /*.  Since  G  =  Ut6r  Gp(t)  9  is  a  finite  path  of  G, 
there  is  a  time  t  such  that  g  is  also  a  path  of  Gp(t).  Since  5  is  compatible  with  g 
of  Gp(t)  and  is  applicable  to  5  is  a  vertex  of  Tj,(t). 

Let  S  — ►  S'  be  any  edge  e  of  T‘.  By  the  argument  above,  there  is  a  time  t  after 
which  both  S  and  S'  are  vertices  of  Tp.  Since  S'  =  5  •  e,  after  time  t  the  edge  e  is 
also  in  Tp.  Thus,  every  vertex  and  every  edge  of  T  is  also  in  (J«er  i  e-»  T  is 
a  subgraph  of  Ut<=r  Tp{t). 

By  Property  1,  U«er  Tp(t)  =  T.  □ 


31 

Lemma  30:  Let  p  be  any  correct  process,  and  S  be  any  vertex  of  Tp.  There  is  a  time 
after  which  the  tags  of  S  in  Tp  are  the  same  as  the  tags  of  S  in  T. 

Proof:  Suppose  that  at  some  time  t,  p  assigns  tag  k  to  vertex  S  of  tree  Tp.  This  means 
that  5  has  a  descendent  S'  in  T  p(t)  such  that  p  has  decided  k  in  S' (I1).  By  Lemma  29(1), 
S'  is  also  a  descendent  of  S  in  Y',  and  since  p  is  correct,  S  has  tag  k  in  T‘  as  well. 

Conversely,  suppose  a  vertex  5  of  a  tree  T‘  of  T  has  tag  k.  We  show  that,  eventually, 
p  also  assigns  tag  A:  to  5  in  Tp.  Since  5  has  tag  k  in  T',  5  has  a  descendent  S'  in  T‘ 
such  that  some  correct  process  has  decided  k  in  S'(P)  (cf.  tagging  rule  in  Figure  9). 
By  Lemma  12(1),  there  is  a  descendent  S"  of  S'  in  T‘,  such  that  all  correct  processes, 
including  p,  have  decided  in  S"(P).  By  Lemma  7,  S"(P)  is  a  configuration  of  a  partial 
run  of  Consensu&p.  By  the  Agreement  property  of  Consensus,  p  must  have  decided  k  in 
S''(P).  Consider  the  path  that  starts  from  the  root  of  T*  and  goes  to  vertex  S  and  then 
to  S".  By  Lemma  29(3),  there  is  a  time  t  after  which  this  path  is  also  in  Tp.  Therefore, 
when  p  executes  the  tagging  rule  of  Figure  11  after  time  t,  p  assigns  tag  fc  to  S  in  Yp 
(because  p  has  decided  k  in  S"(P),  and  S"  is  a  descendent  of  5  in  Tp).  □ 

Recall  that  p *  is  the  correct  process  obtained  by  applying  the  selection  rule  of  Figure  9 
to  the  infinite  simulation  forest  T.  We  now  show  that  there  is  a  time  after  which  any 
correct  p  always  selects  p*  when  it  applies  the  corresponding  selection  rule  of  Figure  11  to 
its  own  finite  approximation  of  the  simulation  forest  Tp.  Roughly  speaking,  the  reason 
is  as  follows.  By  Lemma  30,  there  is  a  time  t  after  which  the  tags  of  all  the  roots  in 
p's  forest  Tp  are  the  same  as  in  the  infinite  forest  T.  Since  these  tags  determine  the 
sets  of  monovalent  and  bivalent  critical  indices,  after  time  t  these  sets  according  to  p  are 
the  same  as  in  T.  Let  i  be  the  minimum  critical  index  in  these  sets,  and  consider  the 
situation  after  time  t.  If  i  is  monovalent  critical,  the  selection  rule  of  Figure  11  selects  p*, 
which  is  what  p*  is  in  this  case.  If » is  bivalent  critical,  then  p  selects  the  deciding  process 
of  its  current  minimum  decision  gadget  of  Tp  (if  it  has  one).  This  case  is  examined  below. 

Let  7*  be  the  minimum  decision  gadget  of  T’  (so,  p*  is  the  deciding  process  of  7*). 
For  a  while,  7*  may  not  be  the  minimum  decision  gadget  of  Yp.  This  may  be  because 
7*  (and  its  tags)  is  not  yet  in  Tp.  However,  by  Lemmata  29(3)  and  30,  7*  (including  its 
tags)  will  eventually  be  in  Tp.  Alternatively,  it  may  be  because  Yp  contains  a  subgraph 
7  whose  encoding  is  smaller  than  7*’s,  and  for  a  while  7  looks  like  a  decision  gadget 
according  to  its  present  tags.  However,  by  Lemma  30,  p  will  eventually  determine  all 
the  tags  of  7,  and  discover  that  7  is  not  really  a  decision  gadget.  Since  there  are  only 
finitely  many  graphs  whose  encoding  is  smaller  than  7*’s,  p  will  eventually  discard  all 
the  “fake”  decision  gadgets  (like  7)  that  are  smaller  than  7*,  and  then  select  7*  as  its 
minimum  decision  gadget.  After  that  time,  p  always  selects  the  deciding  process  of  7* 
—  which  is  precisely  p*,  in  this  case. 

Theorem  31:  For  all  correct  processes  p,  there  is  a  time  after  which  outputp  =  p*, 
forever. 

Proof:  Let  i*  denote  the  critical  index  selected  by  Line  1  of  Figure  9  applied  to  T.  By 
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Lemma  30,  there  is  a  time  tinit  after  which  every  root  of  Tp  has  the  same  tags  as  the 
corresponding  root  of  T.  Thus  after  time  tinit,  p  always  sets  i  =  t*  in  Line  1  of  Figure  11. 
We  now  show  that  there  is  a  time  after  which  the  computation  component  of  p  (Figure 
11)  always  return  p*.  There  are  two  cases: 

1.  i*  is  monovalent  critical.  In  this  case,  p *  is  process  p*.  (by  Line  2  of  the  selection 
rule  Figure  9).  Similarly,  after  time  tinit:  (a)  p  always  sets  t  to  i*  (Line  1  of 
Figure  11);  (b)  p  always  returns  p< .  (Line  2  of  Figure  11). 

2.  i*  is  bivalent  critical.  Let  7*  denote  the  smallest  decision  gadget  of  T‘* .  In  this  case, 
p*  is  the  deciding  process  of  7*.  Since  7*  is  a  finite  subgraph  of  T’* ,  by  Lemma  29(3), 
there  is  a  time  after  which  7*  is  also  a  subgraph  of  Tp.  By  Lemma  30,  there  is  a 
time  tj-  after  which  all  the  (finitely  many)  vertices  of  7*  receive  the  same  tags  in 
T'*  and  Yp  .  Thus  after  time  t7-,  7*  is  a  also  decision  gadget  of  Tp . 

Since  each  graph  is  encoded  as  a  unique  natural  number,  there  axe  finitely  many 
graphs  with  a  smaller  encoding  than  7*.  Let  Q  denote  the  set  of  graphs  with  a 
smaller  encoding  than  7*,  and  7  be  any  graph  in  Q.  We  show  that  there  is  a  time 
after  which  7  is  not  a  decision  gadget  of  Tp  .  There  are  two  cases: 

(a)  7  is  not  a  subgraph  of  T’* .  In  this  case,  by  Lemma  29(1),  7  is  never  a  subgraph 
of  Tj,*. 

(b)  7  is  a  subgraph  of  T*\  Since  7*  is  the  smallest  decision  gadget  of  T‘*  and  7 
is  smaller  than  7*,  7  is  not  a  decision  gadget  of  T*\  By  Lemma  30,  there  is 
a  time  t7  after  which  all  the  (finitely  many)  vertices  of  7  have  the  same  tags 
in  T’*  and  Tp  .  Thus  after  time  f7,  7  is  not  a  decision  gadget  of  Yp. 

Since  Q  is  finite,  there  is  a  time  tq  after  which  no  graph  in  Q  is  a  decision  gadget 
of  Tj,. 

Consider  the  process  that  is  returned  by  the  computation  component  of  p  (Fig¬ 
ure  11)  at  any  time  t  >  max(tinn,t7.  ,tq).  Since  t  >  tinit,  p  always  sets  i  to  i*  in 
Line  1.  Since  t  >  t7.,  7*  is  a  decision  gadget  of  Tp(<).  Finally,  since  t  >  tq,  7*  is 
the  smallest  decision  gadget  of  Tp(t).  Thus,  since  i*  is  bivalent,  at  any  time  after 
max( t7*,  tq),  Line  3  of  Figure  11  returns  the  deciding  process  of  7*.  Therefore, 
after  time  majc(t,njt,  t7- ,tq),  the  computation  component  of  p  always  returns  p*. 

From  the  above,  there  is  a  time  after  which  p  sets  oufputp  «—  p*,  forever,  in  Line  3  of 
Figure  10.  □ 

We  now  have  all  the  pieces  needed  to  prove  our  main  result,  Theorem  2  in  Section  5: 

Theorem  2:  For  all  environments  S,  if  a  failure  detector  V  can  be  used  to  solve  Con¬ 
sensus  in  5,  then  V  fl. 

Proof:  Consider  the  execution  of  algorithm  Tp_n  in  any  environment  S.  By  Theo¬ 
rem  31,  there  is  a  time  after  which  all  correct  processes  set  output p  =  p*,  forever.  By 
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Theorem  24,  p*  is  a  correct  process.  Thus,  Tv—n  is  a  reduction  algorithm  that  transforms 
V  into  fi.  In  other  words,  17  is  reducible  to  V.  □ 

7  Discussion 

7.1  Granularity  of  atomic  actions 

Our  model  incorporates  very  strong  assumptions  about  the  atomicity  of  steps.  First, 
the  three  phases  of  each  step  are  assumed  to  occur  indivisibly,  and  at  a  single  time.  In 
particular,  the  failure  of  a  process  cannot  happen  in  the  “middle  of  a  step”.  This  allows 
us  to  associate  a  single  time  t  with  a  step  and  think  of  the  step  as  occuring  at  that 
time.  Second,  in  the  send  phase  of  a  step  a  message  is  sent  to  all  processes.  Given  that 
the  entire  step  is  indivisible,  this  means  that  either  all  or  none  of  the  correct  processes 
eventually  receive  the  message  sent  in  a  step.  Finally,  no  two  steps  can  occur  at  the 
same  time.11  These  assumptions  are  convenient  because  they  make  the  formal  model 
simpler  to  describe.  Also,  they  are  consistent  with  those  made  in  the  model  of  [FLP85] 
that  provided  the  impetus  for  this  work. 

On  the  other  hand,  in  [CT91]  a  model  with  weaker  properties  is  used.  There,  the 
three  phases  of  a  step  need  not  occur  indivisibly,  and  may  occur  at  different  times.  Even 
within  the  send  phase,  the  messages  sent  to  the  different  processes  may  be  sent  at  different 
times.  Thus,  a  failure  may  occur  in  the  middle  of  the  send  phase,  resulting  in  some  correct 
processes  eventually  receiving  the  messages  sent  to  them  in  that  step  while  others  never 
do.  Also,  actions  of  different  processes  may  take  place  simultaneously,  subject  to  the 
restriction  that  a  message  can  only  be  received  strictly  after  it  was  sent.  Since  [CT91]  is 
mainly  concerned  with  showing  how  to  use  various  types  of  failure  detectors  to  achieve 
Consensus,  the  use  of  a  weaker  model  strengthens  the  results.  (In  fact,  the  negative 
results  of  [CT91]  hold  even  in  the  model  of  this  paper,  with  the  stronger  atomicity 
assumptions.) 

The  question  naturally  arises  whether  our  result  also  applies  to  this  weaker  model. 
In  other  words,  if  a  failure  detector  V  can  be  used  to  solve  Consensus  in  the  weak  model, 
is  it  true  that  we  can  transform  V  to  W  in  that  model ?  It  turns  out  that  the  answer  is 
affirmative.  To  see  this,  first  note  that  if  V  solves  Consensus  in  the  weak  model  then  it 
surely  solves  Consensus  in  the  strong  model.  By  our  result,  V  can  be  transformed  to  W 
in  the  strong  model.  It  remains  to  show  that  V  can  be  transformed  to  W  in  the  weak 
model.  This  is  not  obvious,  since  it  is  conceivable  that  the  extra  properties  of  the  strong 
model  are  crucial  in  the  transformation  of  V  to  W.  Fortunately,  the  transformation 
presented  in  this  paper  actually  works  even  in  the  weak  model! 

To  see  this,  it  is  sufficient  to  make  sure  that  the  communication  component  of  the 
transformation  (cf.  Figure  10  in  Section  6.5.1)  constructs  graphs  that  satisfy  the  proper¬ 
ties  listed  in  Lemma  26,  even  if  we  run  it  in  the  weak  model.  It  is  not  difficult  to  verify 

11  This  is  reflected  in  our  formal  model  by  the  fact  that  the  list  of  times  in  a  run  (which  indicate  when 
the  events  in  the  run’s  schedule  occur)  is  increasing. 
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that  this  is  indeed  so.  The  proof  is  virtually  the  same,  except  for  the  fact  that  we  must 
distinguish  the  time  t  in  which  a  process  p  queries  its  failure  detector  and  the  time  t'  in 
which  p  adds  the  value  it  saw  into  Gp.  In  our  proof  we  assume  that  t  =  t'\  in  the  weak 
model  we  would  have  t  <  t'.  Similar  comments  apply  to  all  actions  within  a  step  that  are 
no  longer  assumed  to  occur  at  the  same  instant  of  time.  These  changes  make  the  proofs 
slightly  more  cumbersome,  since  we  must  introduce  notation  for  all  the  different  times 
in  which  relevant  actions  within  a  step  take  place,  but  the  reasoning  remains  essentially 
the  same.12 

Thus,  our  result  is  not  merely  a  fortuitous  consequence  of  some  whimsical  choice  of 
model.  We  view  the  robustness  of  the  result  across  different  models  of  asynchrony  as 
further  testimony  to  the  significance  of  the  failure  detector  W. 

7.2  Failure  detection  and  partial  synchrony 

The  fundamental  reason  why  Consensus  cannot  be  solved  in  completely  asynchronous 
systems  is  the  fact  that,  in  such  systems,  it  is  impossible  to  reliably  distinguish  a  pro¬ 
cess  that  has  crashed  from  one  that  is  merely  very  slow.  In  other  words,  Consensus 
is  unsolvable  because  accurate  failure  detection  is  impossible.  On  the  other  hand,  it 
is  well-known  that  Consensus  is  solvable  (deterministically)  in  completely  synchronous 
systems  —  that  is,  systems  where  all  processes  take  steps  at  the  same  rate  and  each 
message  arrives  at  its  destination  a  fixed  and  known  amount  of  time  after  it  is  sent.  In 
such  a  system  we  can  use  timeouts  to  implement  a  “perfect”  failure  detector  —  i.e.,  one 
in  which  no  process  is  ever  wrongly  suspected,  and  every  faulty  process  is  eventually 
suspected.  Thus  the  ability  to  solve  Consensus  in  a  given  system  is  intimately  related  to 
the  failure  detection  capabilities  of  that  system.  This  realization  led  to  the  extension  of 
the  asynchronous  model  of  computation  with  failure  detectors  in  [CT91].  In  that  paper 
Consensus  is  shown  to  be  solvable  even  with  very  weak  failure  detectors  that  could  make 
an  infinite  number  of  “mistakes”. 

A  different  tack  on  circumventing  the  unsolvability  of  Consensus  is  pursued  in  [DDS87] 
and  [DLS88].  The  approach  of  those  papers  is  based  on  the  observation  that  between 
the  completely  synchronous  and  completely  asynchronous  models  of  distributed  systems 
there  lie  a  variety  of  intermediate  “partially  synchronous”  models.  For  instance,  in  one 
model  of  partial  synchrony,  processes  take  steps  at  the  same  rate,  but  message  delays 
are  unbounded  (albeit  finite).  Alternatively,  it  may  be  known  that  message  delays  are 
bounded,  but  the  actual  bound  may  be  unknown.  In  yet  another  variation,  the  eventual 
maximum  message  delay  is  known,  but  during  some  initial  period  of  finite  but  unknown 
duration  some  messages  may  experience  longer  delays.  These  and  many  other  models  of 

lJ  Another  problem  that  must  be  confronted  is  that  in  the  proofs  of  Lemmata  25  and  26  we  often  refer 
to  the  “first  graph”  in  which  a  vertex  or  edge  is  present.  In  the  strong  model  there  is  no  difficulty  with 
this,  since  processes  cannot  execute  steps  simultaneously.  In  the  weak  model,  we  have  to  justify  that  it 
makes  sense  to  speak  of  the  “first"  graph  to  contain  a  vertex  or  edge,  in  spite  of  the  fact  that  certain 
actions  can  be  executed  at  the  same  time.  The  fact  that  a  message  can  be  received  only  after  it  was 
sent  is  needed  here. 
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partial  synchrony  are  studied  in  [DDS87]  and  [DLS88],  and  the  question  of  solvability  of 
Consensus  in  each  of  them  is  answered  either  positively  or  negatively. 

In  particular,  [DDS87]  defines  a  space  of  32  models  by  considering  five  key  parameters, 
each  of  which  admits  a  “favourable”  and  an  “unfavourable”  setting.  For  instance,  one 
of  the  parameters  is  whether  the  maximum  message  delay  is  known  (favourable  setting) 
or  not  (unfavourable  setting).  Each  of  the  32  models  corresponds  to  a  particular  setting 
of  the  5  parameters.  [DDS87]  identifies  four  “minimal”  models  in  which  Consensus  is 
solvable.  These  are  minimal  in  the  sense  that  the  weakening  of  any  parameter  from 
favourable  to  unfavourable  would  yield  a  model  of  partial  synchrony  where  Consensus 
is  unsolvable.  Thus,  within  the  space  of  the  models  considered,  [DDS87]  and  [DLS88] 
delineate  precisely  the  boundary  between  solvability  and  unsolvability  of  Consensus,  and 
provide  an  answer  to  the  question  “What  is  the  least  amount  of  synchrony  sufficient  to 
solve  Consensus?”. 

Failure  detectors  can  be  viewed  as  a  more  abstract  and  modular  way  of  incorporating 
partial  synchrony  assumptions  into  the  model  of  computation.  Instead  of  focusing  on 
the  operational  features  of  partial  synchrony  (such  as  the  five  parameters  considered 
in  [DDS87]),  we  can  consider  the  axiomatic  properties  that  failure  detectors  must  have 
in  order  to  solve  Consensus.  The  problem  of  implementing  a  given  failure  detector  in 
a  specific  model  of  partial  synchrony  becomes  a  separate  issue;  this  separation  affords 
greater  modularity. 

To  see  the  connection  between  partial  synchrony  and  failure  detectors,  it  is  useful  to 
examine  how  one  might  go  about  implementing  a  failure  detector.  By  the  impossibility 
result  of  [FLP85],  a  failure  detector  that  can  be  used  to  solve  Consensus  cannot  be 
implemented  in  a  completely  asynchronous  system.  Now  consider  partially  synchronous 
systems  in  which  correct  processes  have  accurate  timers  (i.e.,  they  can  measure  elapsed 
time).  If  in  such  a  system  message  delays  are  bounded  and  the  maximum  delay  is  known, 
we  can  use  timeouts  to  implement  the  “perfect”  failure  detector  described  above.  In  a 
weaker  system  where  message  delays  are  bounded  but  the  maximum  delay  is  not  known, 
we  can  implement  a  failure  detector  satisfying  a  weaker  property:  eventually  no  correct 
process  is  suspected.  This  can  be  done  by  using  timeouts  of  increasing  length;  once  the 
timeout  period  has  been  increased  sufficiently  to  exceed  the  unknown  maximum  delay, 
no  correct  process  will  be  suspected.  A  failure  detector  with  the  same  property  can  also 
be  implemented  m  a  distributed  system  where  the  eventual  maximum  message  delay 
is  known,  but  messages  may  be  delayed  for  longer  during  some  initial  period  of  finite 
but  unknown  duration.  With  these  remarks  we  wish  to  illustrate  two  points:  First, 
that  stronger  failure  detectors  correspond  to  stronger  models  of  partial  synchrony;  and 
second,  that  the  same  failure  detector  can  be  implemented  in  different  models  of  partial 
synchrony. 

Studying  failure  detectors  rather  than  various  models  of  partial  synchrony  has  several 
advantages.  By  determining  whether  Consensus  is  solvable  using  some  specific  failure 
detector  we  thereby  determine  whether  Consensus  is  solvable  in  all  systems  in  which 
that  failure  detector  can  be  implemented.  An  algorithm  that  relies  on  the  axiomatic 
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properties  of  a  given  failure  detector  is  more  general,  more  modular,  and  simpler  to 
understand  than  one  that  relies  directly  on  some  specific  operational  features  of  partial 
synchrony  (that  can  be  used  to  implement  the  given  failure  detector). 

From  this  more  abstract  point  of  view,  the  question  “What  is  the  least  amount  of 
synchrony  sufficient  to  solve  Consensus?”  translates  to  “What  is  the  weakest  failure 
detector  sufficient  to  solve  Consensus?”.  In  contrast  to  [DDS87],  which  identified  a  set 
of  minim al  models  of  partial  synchrony  in  which  Consensus  is  solvable,  we  are  able  to 
exhibit  a  single  minimum  failure  detector  that  can  be  used  to  solve  Consensus.  The 
technical  device  that  made  this  possible  is  the  notion  of  reduction  between  failure  de¬ 
tectors.  We  suspect  that  a  corresponding  notion  of  reduction  between  models  of  partial 
synchrony,  although  possible,  would  be  more  complex.  This  is  because  there  are  models 
which  are  not  comparable  in  general  (in  the  sense  that  there  are  tasks  that  are  possible 
in  one  but  not  in  the  other  and  vice  versa),  although  they  are  comparable  as  far  as 
failure  detection  is  concerned  —  which  is  all  that  matters  for  solving  Consensus!  In  this 
connection,  it  is  useful  to  recall  our  earlier  observation,  that  the  same  failure  detector 
can  be  implemented  in  different  (indeed,  incomparable)  models  of  partial  synchrony. 

7.3  Weak  Consensus 

[FLP85]  actually  showed  that  even  the  Weak  Consensus  problem  cannot  be  solved  (de¬ 
terministically)  in  an  asynchronous  system.  Weak  Consensus  is  like  Consensus  except 
that  the  validity  property  is  replaced  by  the  following,  weaker,  property 

Non-triviality:  There  is  a  run  of  the  protocol  in  which  correct  processes  decide  0,  and 
a  run  in  which  correct  processes  decide  1. 

Unlike  validity,  this  property  does  not  explicitly  prescribe  conditions  under  which  the 
correct  processes  must  decide  0  or  1  —  it  merely  states  that  it  is  possible  for  them  to 
reach  each  of  these  decisions.  It  is  natural  to  ask  whether  our  result  holds  for  this  weaker 
problem  as  well.  In  fact,  it  is  easy  to  modify  our  proof  to  show  the  following 

Theorem:  For  all  environments  £ ,  if  a  failure  detector  V  can  be  used  to  solve  Weak 
Consensus  in  £,  then  V  Q. 

We  briefly  sketch  the  modifications  of  the  proof  needed  to  obtain  this  strengthening  of 
Theorem  2.  The  only  use  of  the  validity  property  is  in  the  proof  of  Lemma  18  which 
states  that  the  root  of  T°  is  0-valent  and  the  root  of  Tn  is  1-valent.  This,  in  turn,  is 
used  in  the  proof  of  Lemma  19,  which  states  that  a  critical  index  exists. 

To  prove  the  stronger  theorem,  we  concentrate  on  the  forest  induced  by  all  initial 
configurations  —  not  just  1°, . . . ,  Jn.  Thus,  the  forest  now  will  have  2n  trees,  rather  than 
only  n  +  1.  Consider  the  n  initial  values  of  processes  in  an  initial  configuration  as  an 
n-bit  vector,  and  fix  any  n-bit  Gray  code.13  Let  1°, . . . ,  J2"-1  be  the  initial  configurations 

11 An  n-bit  Gray  code  is  a  sequence  of  all  possible  n-bit  rectors  where  successive  vectors,  as  well  as 
the  first  and  last  vectors,  differ  only  in  the  value  of  one  position.  It  is  well-known  that  such  codes  exist 
for  all  n  >  1. 
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listed  in  the  order  specified  by  the  Gray  code,  and  T*  be  the  tree  T£,  for  i  =  0, . . . ,  2n  - 1. 
We  use  the  same  definition  for  a  critical  index  as  we  had  before:  Index  i  =  0, 1, . . . ,  2n  -  1 
is  critical  if  the  root  of  T*  is  bivalent  or  the  root  of  T*  is  1-valent  while  the  root  of  T,_1 
is  0- valent.  The  only  difference  is  that  we  now  take  subtraction  to  be  modulo  2n,  so  that 
when  i  =  0,  i  —  1  =  —  1  =  2n  —  1.  We  can  now  prove  an  analogue  to  Lemma  19. 

Lemma:  There  is  a  critical  index  i,  0  <  i  <  2n  —  1. 

Proof:  If  the  root  of  some  T’  is  bivalent  then  we  are  done.  Otherwise,  the  root  of  each 
tree  in  the  forest  is  monovalent.  By  Weak  Validity,  there  exist  0  <  i,j  <  2"  —  1  so  that 
the  root  of  T*  is  0-valent  and  the  root  of  Y;  is  1-valent.  By  considering  the  sequence 
T\  Y,+l, . . . ,  Y;  ,  (where  addition  is  modulo  2")  it  is  easy  to  see  that  the  root  of  some 
Tfe,  k  ^  i,  that  appears  in  that  sequence  is  1-valent,  while  the  root  of  Tfc_1  is  0-valent. 
By  definition,  A:  is  a  critical  index.  □ 

The  rest  of  the  proof  remains  unchanged. 

7.4  Failure  detectors  with  infinite  range  of  output  values 

The  failure  detectors  in  [RB91,CT91]  only  output  lists  of  processes  suspected  to  have 
crashed.  Since  the  set  of  processes  is  finite,  the  range  of  possible  output  values  of  these 
failure  detectors  is  also  finite.  In  this  paper  our  model  allows  for  failure  detectors  with 
arbitrary  ranges  of  output  values,  including  the  possibility  of  infinite  ranges!  We  illustrate 
the  significance  of  this  generality  by  describing  a  natural  class  of  failure  detectors  whose 
range  of  output  values  is  infinite  (though  each  value  output  is  finite). 

Example:  One  apparent  weakness  with  our  formulation  of  failure  detection  is  that  a 
brief  change  in  the  value  output  by  a  failure  detector  module  may  go  unnoticed.  For 
example,  process  p’s  module  of  the  given  failure  detector,  T>,  may  output  dx  at  time 
ti,  d2  at  a  later  time  t2  and  di  again  at  time  t3  after  t2.  If  due  to  the  asynchrony  of 
the  system  p  does  not  take  a  step  between  time  1 1  and  <3,  p  may  never  notice  that  its 
failure  detector  module  briefly  output  <f2-  A  natural  way  of  overcoming  this  problem 
is  to  replace  V  with  failure  detector  V  that  has  the  following  property:  V'  maintains 
the  same  list  of  suspects  as  V  but  when  queried,  V  returns  the  entire  history  of  its  list 
of  suspects  up  to  the  present  time.  In  this  maimer,  correct  processes  are  guaranteed 
to  notice  every  change  in  V's  list  of  suspects.  As  the  system  continues  executing,  the 
values  output  by  V  grow  in  size.  This  means  that  V  has  an  infinite  range  of  output 
values. 

However,  since  V  is  a  function  of  F,  the  failure  pattern  encountered,  V  is  also  a 
function  of  F,  and  can  be  described  by  our  model.  Thus,  the  result  in  this  paper  applies 
to  V ,  a  natural  failure  detector  with  infinite  range  of  output  values. 
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